1. Introduction and Motivation 
2. Mathematical Review 
3. Signal Spaces 
1. Metric Spaces 
Spaces 
3. Continuity and Convergence of Functions 
4, Vector Spaces 
5. Normed Spaces 
6. Topological Concepts 
7. Inner Product Spaces and Hilbert Spaces 
8. The ell 2 space 
9. Orthogonality 
10. Gram-Schmidt Process 
11. Projection Theorem 
12. Least Squares Estimation in Hilbert Spaces 
13. The Hilbert Space of Random Variables 
14. Infinite-Dimensional Hilbert Spaces 
4. Functionals and Operators 
. Linear Functionals 
. Dual Spaces 
. Linear Operators 
. Properties of Linear Operators 
. Fundamental Subspaces of an Operator 
. Adjoint Operators 
. Matrix Representations of Linear Operators 
. Eigendecomposition of Linear Operators 
. The Karhuenen-Loeve Transform 
. Singular Value Decomposition 
9. Optimization 
1. Optimization in Hilbert Spaces 


— 
CUOON DU BRWN RP 


. Local Optimization 

. Calculus of Variations 

. Constrained Optimization 

. Second Order Conditions 

. Constrained Optimization with Inequality Constraints 
. Fenchel Duality 


NOD OBR WN 


Introduction and Motivation 
Introduction and motivation for a graduate electrical engineering course on 
Signal Theory. 


Introduction and Motivation 


Most areas of electrical and computer engineering (beyond signal 
processing) deal with signals. Communications is about transmitting, 
receiving, and interpreting signals. Signals are used to probe and model 
systems in control and circuit design. The images acquired by radar systems 
and biomedical devices are signals that change in space and time, 
respectively. Signals are used in microelectronic devices to convey digital 
information or send instructions to processors. 


This course will provide a mathematical framework to handle signals and 
operations on signals. Some of the questions that will be answered in this 
course include: 


e What is a signal? How do we represent it? 

e How do we represent operations on signals? 

e What does it mean for signals to be similar/different from each other? 

e When is a candidate signal a good/bad approximation (i.e., a simplified 
version) of a target signal? 

e When is a signal “interesting” or “boring”? 

e How can we characterize groups of signals? 

¢ How do we find the best approximation of a target signal in a group of 
candidates? 


Course Overview 


Signal Theory 
The signal theory presented in this course has three main components: 


e Signal representations and signal spaces, which provide a framework 
to talk about sets of signal and to define signal approximations. 


e Distances and norms to evaluate and compare signals. Norms provide 
a measure of strength, amplitude, or “interestingness” of a signal, and 
distances provide a measure of similarity between signals. 

e Projection theory and signal estimation to work with signals that have 
been distorted, aiming to recover the best approximation in a defined 
set. 


Operator Theory 


Operators are mathematical representations of systems that manipulate a 
signal. The operator theory presented in this course has three main 
components: 


e Operator properties that allow us to characterize their effect on signals 
in a simple fashion. 

¢ Operator characterization that allow us to model their effect on 
arbitrary inputs. 

e Operator operations (no pun intended) that allow us to create new 
systems and reverse the effect of a system on a signal. 


Optimization Theory 


Optimization is an area of applied mathematics that, in the context of our 
course, will allow us to determine the best signal output for a given problem 
using defined metrics, such as signal denoising or compression, codebook 
design, and radar pulse shaping. The optimization theory presented in this 
course has three main components: 


¢ Optimization guarantees that rely on properties of the metrics and 
signal sets we search over to formally ensure that the optimal signal 
can be found. 

e Unconstrained optimization, where we search for the optimum over an 
entire signal space. 


¢ Constrained optimization, where the optimal signal must meet 
additional specific requirements. 


Example 


As an example, consider the following communications channel: 


x — | Transmitter F _ | Channel H | — 4) a | Receiver G | — Decoded message 


Noise n ~ N(0, 07) 


Block diagram for a communications channel 


A mathematical formulation of this channel requires us to: 


e establish which signals x can be input into the transmitter; 

e how the transmitter F’, the channel H, and the receiver G' are 
characterized; 

e how the concatenation of the blocks F' and H is expressed; 

e how the noise addition operation is formulated; 

e how we measure whether the decoded message & is a good 
approximation of the input 7; 

e how is the receiver G designed to be optimal for all the choices above. 


For this example, by the end of the course, you will be able to solve the 
problem of selecting the transmitter/receiver pair F’, G that minimizes the 
power of the error e = X — x while meeting maximum transmission power 


power(F(z)) < Prax. 


constraints 
power(z) 


Mathematical Review 
Review of background mathematical concepts for the Signal Theory course. 


Basic Set Theory 


We begin with a quick review of basic set theory from undergraduate 
courses. 


Definition 1 A set is an unordered collection of objects denoted by a capital 
letter A and written explicitly by listing its elements A = {aj, ag, ...}. 


Definition 2 The union of two sets A and B is denoted by 
AUB:={a#:x26€A Va e€ B}. The intersection of two sets A and B is 
denotedby ANB:={x:xE AAzre B}. 


Definition 3 A set A is contained in another set B, denoted A C B, if 
x€ A=2«x€B.Twosets Aand Bare equalifx Ee ASE B. 


Definition 4 The complement of A is the set A := {x : « ¢ A}. The empty 
set is denoted by y := {}. 


Definition 5 A set is finite if it has a finite number of elements. A set is 
countably infinite if there is a one-to-one relationship between its elements 
and the integers Z. A set is uncountably infinite if it is not finite or 
countably infinite. 


Convexity 


Definition 6 A set A C X is convex if for all x, y € A all convex 
combinations of x and y are in A, i.e., for all O < a < 1 we have 
az+(l—a)yé A. 


Example 1 [link] below shows that the line xy (containing all convex 
combinations of x and y) is included in A; since this is true for each 

x,y € A then the set A is convex. Conversely, for the set B we can find 
two points u,v € B such that the line uv is not completely contained in B; 
therefore, B is not convex. 


Examples of convex and nonconvex regions. 


Fact 1 If A is convex then GA = {8X : x € A} is convex for 6 > 0. 


Definition 7 The convex hull of aset A C X is the smallest convex set S 
such that A C S' 


Example 2 Consider the set A in [link] below, which is not convex. By 
adding to A all points that are convex combinations of elements of A but 
not in A (i.e., the points in the shaded region), we obtain the convex hull of 
A. 


Examples of convex and nonconvex regions. 


A set with a single element is always convex. 


Metric Spaces 
Description of signal spaces and metric spaces. 


Signal Spaces 


We start the content of our course by defining its main concepts of a signal 
and a signal space. 


Definition 1 A signal is the value of some quantity as a function of time, 
space, frequency, etc.; each signal is labeled by a lower-case letter zx. 


Definition 2 A signal space is a set of signals defined by some criterion, 
labeled by an upper-case letter X (since it is a set). 


Some familiar sets of signals are_X = R, X = C, and the set of vectors 
X — R”. 


Definition 3 The signal space L [a, b contains all signals x(t) such that 
z(t) = 0 forallt <aort > band [’ |x (t)|?dt < 00 (ie., at no time the 
signal is infinite). 


Metric Spaces 


Definition 4 A metric d: X x X — Risa function used to measure 
distance between pairs of elements of X with the following properties: for 
all zw, y,2z © X, 


d(x,y) = d(y, x) (symmetry), 
ae. >0 Gaon: -negativity), 


y) 
Oe; y) = 0 ae Y; 
. a(x, z) < d(x, y) + d(y, z) (triangle inequality). 


If d is a metric on X, the pair (X, d) is called a metric space. A set X can 
have multiple metrics, leading to different metric spaces. 


Example 1 The following are some initial examples of metric spaces: 


e X = R with do (z, y) = |x — y| for all x, y € R: it is easy to check 
properties (1-4). 


1 if 
+ X=Rwith d (cy) = | pak 


; for all z, y € R: it is easy to 
OQ. ir. aay 


check properties (1-3). To verify (4), assume that 
d(x, y) + d(y, z) = 0; then both d(z, y) = 0 and d(y, z) = 0, which 
means x = yand y = 2; by transitivity, = z and 
d(x, z) =0 < d(z, y) + d(y, z), as desired. Now assume that 
d(x, y) + d(y, z) = 1; then we immediately get 
d(x, z) < d(x, y) + d(y, z), as desired. 
° X =R" with metric d? (w,y) = (0, |ai — ys?) /”, known as 
the Euclidean metric. 
e X = R” with metric d? (z, y) := = ee ou 


1/2 
e X = L, [a,b] with metric dz (x =f |x (t) — y(t)| *at) 


Formally, dz is a cuienion on es , since ce are cane x # y that 
yield dz (x, y) = 0. However, one can define a new signal space 
where all signals with d(x, y) = 0 are equal to each other. 


1/p 
¢ X = Ly [a,b] with metric d, ( a |x (t — y(t)|Pat) , for 


1 < p < ow, an extension of the metric do. 
¢ X = Ly |a, bj with metric dx (x, y) :=suPzeja.) |e (t) — y(t)]; 
metric solves the equivalence problem of do. 


Here, sup (A) is the supremum of A, i.e., the smallest value xo € R such 
that x < xp V x € A. Similarly, inf (A) is the infimum of A, i-e., the 
largest value 2q such thatrp <a V@E A. 


Convergent Sequences, Cauchy Sequences, and Complete Spaces 
Description of convergent sequences and Cauchy sequences in metric spaces. Description of complete spaces. 


Convergence 


The concept of convergence evaluates whether a sequence of elements is getting “closer” to a given point or 
not. 


Definition 1 Assume a metric space (X, d) and a countably infinite sequence of elements 
{xn} := {tn,n = 1,2,3,...} C X. The sequence {x,,} is said to converge to x € X if for any € > 0 there 


exists an integer m9 € Z* such that d(x, 2,) < € for all nm > no. A convergent sequence can be denoted as 
noo 
limy so Ln = £ OF £, —> @. 


Illustration of a 


convergent 
sequence {z; }. 


Note that in the definition mo is implicitly dependent on e, and therefore is sometimes written as no (€). Note 
also that the convergence of a sequence depends on both the space X and the metric d: a sequence that is 
convergent in one space may not be convergent in another, and a sequence that is convergent under some 
metric may not be convergent under another. Finally, one can abbreviate the notation of convergence to 
Ln, —> x when the index variable n is obvious. 

n—- oo 
Example 1 In the metric space (R, do) where do (x, y) = |x — y|, the sequence x, = 1/n gives x, —> 0: 
fix € and let np > [1/e] (i.e., the smallest integer that is larger than 1/e). If n > no then 
Equation: 


1 


verifying the definition. So by setting no (€) > [1/e], we have shown that {z,,} is a convergent sequence. 
Example 2 Here are some examples of non-convergent sequences in (R, do): 


° <n =n? diverges as n —> 00, as it constantly increases. 

* 2n =1+(-1)" (ie., the sequence {x,} = {0, 2, 0, 2, ...}) diverges since for « < 1 there does not exist 
an 7 that holds the definition for any choice of limit x. More explicitly, assume that a limit x exists. If 
x ¢ (0, 2] then for any € < 2 one sees that for either even or odd values of n we have d(x, xp) > €, and 


so no no holds the definition. If z € [0, 2] then select e = + min (x, 2 — 2). We will have that 


d(x, 2») > € for all n, and so no mo can hold the definition. Thus, the sequence does not converge. 
Theorem 1 If a sequence converges, then its limit is unique. 


Proof: Assume for the sake of contradiction that x, > x andz, — y, with x  y. Pick an arbitrary € > 0, 
and so for the two limits we must be able to find no and 9, respectively, such that d(x, 2p) < €/2 ifn > no 
and d(y, Zn) < €/2 if n > np. Pick n >max (no, No); using the triangle inequality, we get that 

d(x,y) <d(x,x,*) +d(x,*,y) < €/2+€/2 =e. Since for each € we can find such an n’, it follows that 
d(x, y) < € for all « > 0. Thus, we must have d(x, y) = 0 and x = y, and so the two limits are the same and 
the limit must be unique. 


Cauchy Sequences 


The concept of a Cauchy sequence is more subtle than a convergent sequence: each pair of consecutive 
elements must have a distance smaller than or equal than that of any previous pair. 


Definition 2 A sequence {x,,} is a Cauchy sequence if for any « > 0 there exists an m9 € Z* such that for all 
j,k > no we have d(x;,r%) < €. 


As before, the choice of ng depends on ¢, and whether a sequence is Cauchy depends on the metric space 
(X, d). That being said, there is a connection between Cauchy sequences and convergent sequences. 


Theorem 2 Every convergent sequence is a Cauchy sequence. 


Proof: Assume that a sequence x, — a in (X, d). Fix e > 0. Since {x,,} is convergent, there must exist an 
no € Z* such that d(an, x) < €/2 for all n > no. Now, pick j, k > no. Then, using the triangle inequality, 
we have d(x;,x~) <d(xj,x) + d(x, xx) < €/2 + €/2 (since both j and k are greater or equal to no) and 
so d(x;, 2x) < €. Therefore, the sequence {z,,} is Cauchy. 


One may wonder if the opposite is true: is every Cauchy sequence a convergent sequence? 


Example 3 Focus on the metric space (X, do) with X = (0,1] = {« € R: 0 < a < 1}. Now consider the 
sequence x, = 1/n in X. This is a Cauchy sequence: one can show this by picking no (€) > [2/e] and using 
the triangle inequality to get 

Equation: 


1 1 
do (5, x) < do (2j,0) + do (x, 0) =x; |+|e0|= j a ee < =e. 


However, this is not a convergent sequence: we've shown earlier that x, — 0. Since 0 ¢ (0, 1] and a sequence 
has a unique limit, then there is no x € (0, 1] such that z, > a. 


Complete Metric Spaces 
Whether Cauchy sequences converge or not underlies the concept of completeness of a space. 


Definition 4 A complete metric space is a metric space in which all Cauchy sequences are convergent 
sequences. 


Example 5 The metric space (X, do) with X = (0, 1] from earlier is not complete, since we found a Cauchy 
sequence that converges to a point outside of X. We can make it complete by adding the convergence point, 


ie., if X’ = [0, 1] then (X’, do) is a complete metric space. 


Example 6 Let CT] denote the space of all continuous functions with support T. If we pick the metric 
1/2 
dy (254) = Ge |x (t)-—y (1)|?at) , then (X, d2) is a metric space; however, it is not a complete metric 


space. 


To show that the space is not complete, all we have to do is find a Cauchy sequence of signals within C[T) 


that does not converge to any signal in C[T]. For simplicity, fix 7 = |[—1, 1]. Fortuitously, we find the 
sequence illustrated in [link], which can be written as 
Equation: 

-1 if t<-1/n, 


g,(t)=<1 if t>1/n, 
nt if —1/n<t<1/n. 


Illustration of a 


convergent sequence 


Since all these functions are continuous and defined over T, then {x,,} is a sequence in C/T]. We can show 
that {x,,} is a Cauchy sequence: let mo be an integer and pick j > k > no. Then, 
Equation: 


da(xy,2n) = (f° lente) 25 0/%et) — ( / 7 bea (t) — 2 ‘ra ‘2 ( / vat) _ 


< (2/9)? < y/2/no. 


So for a given € > 0, by picking mo such that \/2/no < € (say, for example, no > [2/e?| ), we will have that 
dz (x;,2~) < € for all j,k > no; thus, the sequence is Cauchy. Now, we must show that the sequence does 
not converge within C[Z']: we will find a point z* ¢ C [J] such that for X' = C [T] U {a" } the sequence 

tn — x in (X',d2). By inspecting the sequence of signals, we venture the guess 

Equation: 


-1 if t<0, 
a*(t)=<{1 if t>0,, 
0 if t=0. 


illustrated in [link]. 


Initial guess for a 
z * 
convergence point x 
for the sequence {z;}. 


For this signal, we will have 
Equation: 


d2 (eam) — (f ea ()—2" (at) ° ~ (/, 
I’, Int — (—1)|?dt + [nt — Pa ° 


0 1/n 1/2 
/ Ine + 1Par + f |1 — nt|?dt 
-1/n 0 


1 1/2 
x, (t) — (plats | |tn(t) 2" (|'ar) 


l 
NNN ~ 
oO 
iin 


1/2 


1/2 


/ 
So if we select ng such that (2) 2618, Hig 


3e2? 
so we have shown that z, —> 2”. Now, since a convergent sequence has a unique limit and z* ¢ CT], then 
{x,,} does not converge in (C'[T], dy) and this is not a complete metric space. 


then we have that dy (ins x’) < €forn > no, and 


The property of equivalence between Cauchy sequences and convergent sequences often compels us to define 
metric spaces that are complete by choosing the metric appropriate to the signal space. For example, by 
switching the distance metric to d., (x, y) =sup;er |x (t) — y(t)|, the metric space (C'[T], d..) becomes 
complete. 


Continuity and Convergence of Functions 
Definitions of continuity and convergence for functions defined on metric 
spaces. 


Continuity for Functions 


Definition 1 A function f : (X,d,) — (Y, d,) is continuous at a point 
xo € X if: for any € > 0 there exists ad > 0 such that if d, (xo, x1) < 6, 
then d, (f (xo), f (x1)) < e. 


Definition 2 A function: f : (X,d,) — (Y,d,) is uniformly continuous if: 
for every € > 0 there exists ad > 0 such that for all rq € X: if 
dz (xo, #1) < 6, then d, (f (xo), f (z1)) < €. 


The difference between these definitions is that for a function to be simply 
continuous at a point, one need only find a 6 for the given input zo, while 
for a function to be uniformly continuous, one needs to find for a given € a 
single value of 6 that works in the definition for every point over which the 
function is defined. 


Example 1 Consider the function f : (C [T],d.) — (R, dp) defined as 

f (x) = x (to). In words, the input to f is a continuous function over T, 
and the output from f is the value of the input function evaluated at t = fp. 
We ask the question: Is f continuous at some function input x, (t)? 


e Designate a pair of functions x, (t), x2 (t) such that: 
doo (#1 (t), £2 (t)) < 6, where 
Equation: 


doo (x1 (t), £2 (t)) =Sup |z1 (t) — x2 (t)]. 


In other words, supyer |x1 (t) — x2 (t)| < 6. 

e Next, we see that do (f (21), f (a2)) = |21 (to) — £2 (to)|. 

¢ Because |x1 (to) — £2 (to)| <sup,er [41 (t) — x2 (t)|, by the 
definition of the supremum, we can simply select 6 = e€ to get that if 


do (f (x4), ua (x2)) << é, then 


Equation: 


do (f (21); f (#2)) < doo (#1 (t), #2 (t)) < d =e. 


This shows the continuity of f(a) at x; (t). However, because the selection 
of 6 did not depend on the value of x1, we have also shown that the function 
f is uniformly continuous. 


Convergence of Functions 


Definition 3 The sequence of functions {fn}, fr: (X,dz) > (Y, dy) 
converges pointwise to f : (X,d,) — (Y, d,) if: for each x € X, the 
sequence of values { f, (x) } converges to f(x) in (Y, dy). 


Definition 4 The sequence of functions {f,,}, f, : (X,d,) — (Y,d,) 
converges uniformly to f : (X,d,) — (Y,d,) if: for each € > 0, there 
exists m9 € Z* such that if n > no, then dy (f (x), fn(x)) < € for all 
xe X. 


The difference between these definitions is that for uniform convergence, 
there must exist a single value of m9 that works in the definition of 
continuity for all possible values of x € X. 


Example 2 Consider the sequence of functions 

an (t) : ([0, 1], do) — ([0, 1], do) given by xp, (t) = 4. One naturally 
suspects that «,, (t) may be converging to the zero-valued function. We can 
show this formally as follows: 


¢ Pick some to, and check if {x1 (to), 22 (to), 23 (to), v4 (to),... } is 
converging to 0: Denote a, = Lp (to) = <o, and notice that 
do (an, 0) = jay — 0| = oo So, to pick an ng such that ay < € if 
nm > No, one only needs to note that since “ = a, it then suffices for 


t t 
or < ¢, or in other words no > =. So this sequence converges 
pointwise to 0. 


e Additionally, because the range of possible inputs to x(t) is ¢ € [0, 1], 
we could select np > 4. Because 1 is the maximum value for to, 
no >+> will work for all values of tp € [0,1], and so the 
sequence {x,,} also converges uniformly to the zero function. 


Vector Spaces 
Description of vector spaces, subspaces, bases and spans. 


Vector Spaces 


Definition 1 A linear vector space (X, R, +, -) is given by a signal space X (called 
vectors), a set of scalars R, an addition operation + : X x X — X, and a multiplication 
operation -: R x X — X, such that: 


1. X forms a group under addition: 


aVa,yEeX Ala+yeX, (closed under addition) 

b.40¢€ X suchthatO+ X=X+0=X. (additive identity) 
cVxexX dyeXsuchthatzr+y=0, (additive inverse) 
d.Va,yzExX x+(y+z)=(e@+y)+2z. (associative law) 


2. Multiplication has the following properties: for any x, y € X anda,be€ R: 
a.a-x2€X, (closure in X under multiplication) 
b.a-(b- x2) =(a-b)-az, (compatibility) 
c(a+b)-x=a-x+6-2, (distributive law over R) 
d.a-(a@+y)=a-x+a-y. (distributive law over X) 

3. The set R has the following properties: 


a. There exists 1 € Rst.1-x2=2 Va2eEX, (multiplication identity) 
b. There existsO0 € Rst.0-x2=0 VaeX. (multiplicative null element) 


Example 1 Here are some examples of vector spaces: 


e X = R” (space of all vectors of length n) over R = R is a vector space. 

e X = C” (C is complex numbers) over R = C is a vector space. 

e X = R” over R = C is not a vector space, because closure in X under 
multiplication is not met. 

¢ X = C'|T] (continuous functions in T) over R = R is a vector space. 


Subspaces 


Definition 2 A subset M C X is a linear subspace of X if M itself is a linear vector 
space. Note that, in particular, this implies that any subspace M must obey 0 € M. 


Example 2 Here are some examples of subspaces: 


e In X = R? over R = R, any line that passes through the origin is a subspace of X: 


Equation: 
Mi = { (24) ER? such that 2 = cl. 
ne 


¢ In X = CT] over R = R, the followings are subspaces of X: 
Equation: 


M, =({f (x) =az*+br+c:a,b,ceE R} 
Mz ={f (x): f (zo) = Of. 


In contrast, the set M3, = {f (x) : f (vq) = a £ 0} is not a subspace. 
Proposition 1 If M/ and N are subspaces of X, then M ™ N is also a subspace. 


Proof: We assume that M and N hold properties of linear vector space, and show that so 
does MM N: 


1. Equation: 


zyeEM => «+yEeM 


MAN => 
ae eee => a«rt+yEeN 


} > «r+yeMnNn 


2. Equation: 


ie linear vector space > 0 € M 


=0EMNN 
N linear vector space > 0 € N } 


3. Equation: 


zEM sayeM st. «ct+y=0 


MaAN=> 
oo ee dyEN st. xct+ty=0 


souemnn 


The other properties are shown in a similar fashion. 


Definition 3 A vector x € X, where (X, R,+,-) is a vector space, is a linear 
combination of a set {11, £2,...,n} C X if it can be written as x = So, a; - Ui, 
a; € R. The set of all linear combinations of a set of points {21, £2, ...,2,} builds a 
linear subspace of X. 


Example 3 Q = )~7_, a;2’ is a linear subspace of (C[T], R, +, -) containing the set of 

all quadratic functions, as it corresponds to all linear combinations of the set of functions 
2 

{x es Tt: 


Bases and Spans 


Definition 4 For the set S = {21, £2, ...,£,} C X, the span of S is written as 
Equation: 
n 
[S] = span(S)=¢2:2= a;xj,a; © Rp. 


1=1 


Example 4 The space of quadratic functions Q is written as Q = [S;], with 

Sy = {x?,x,1}. The space can also be written as [S'2] with Sp = {1, x, x” — 2}), ie. 
[Si] = [2]. To prove this, we need to show [S2] C [S41] and [$1] C [$2]. For the former 
case we have 

Equation: 


x = a, + a2z + a3 (2? — 2) 


= (a; — 2a3) + age + age’, 
which means that every element that can be spanned by S», can also be spanned by S}, 
and hence [2] C [S$]. The latter case can be shown in a similar manner. 


Definition 5 A set S is a linearly independent set if 
Equation: 


n 
So ain; =0<a,;=0,Vie 41.5.2; sae pe 
i=1 


Otherwise, the set S is linearly dependent. 


Definition 6 A finite set S of linearly independent vectors is a basis for the space X if 
[|S] = X, ie. if X is spanned by S. 


Definition 7 The dimension of X is the number of elements of its basis |.S|. A vector 
space for which a finite basis does not exist is called an infinite-dimensional space. 


Theorem 1 Any two bases of a subspace have the same number of elements. 


Proof: We prove by contradiction: assume that S$; = {x1,...,@,} and S2 = {y,..., ym} 
,m > n, are two bases for a subspace X with different numbers of elements. We have 
that since y; € X it can be written as a linear combination of Sj: 

Equation: 


n 
Y= S a;L;. 
i=l 


Order the elements of S; above so that a, is nonzero; since y; must be nonzero then at 
least one such a; must exist. Solving the above equation for x1 yields 
Equation: 


1 n 
1 = —_{ vy S AjLX; 
cl i=2 


Thus {y1, £2, £3, -.-) n} is a basis, in terms of which we can write any vector of the 
space X, including yo: 
Equation: 


n 
yo = biyi + S° bj xj. 
id 


Since y1, y2 are linearly independent, at least one of the values of b;, 2 > 2, must be 
nonzero. Sort the remaining x; so that by is nonzero. Solving for x2 results in 
Equation: 


1 by "5b; 
r2 = Be = ba a es a 


Therefore, {y1, Y2, £3, ---) £,} is a basis for X. Continuing in this way, we can eliminate 
each x;, showing that {y1, y2, -.-, yn} is a basis for X. Thus, we have yni1 = 0,4 CiYi 
, or equivalently: 

Equation: 


Cn4i1ynti + cyi =O with cyyi=-1. 


n 
w=1 


As aresult, Sg is linearly dependent and is not a basis. Therefore, all bases of X must 
have the same number of elements. 


Basis Representations 


Having a basis in hand for a given subspace allows us to express the points in the 
subspace in more than one way. For each point x € [S] in the span of a basis 
= {Si, Deere a that is, 

Equation: 


there is a one-to-one map (i.e., an equivalence) between x € [.S] and 
a = {aj,...,@n} € R”, that is, both x and a uniquely identify the point in S. This is 
stated more formally as a theorem. 


Theorem 2 If S is a linearly independent set, then 
Equation: 


3 aj;Sj = S b;S; 
i=1 i=1 


it-and only it-as = b; for? = 1, 2..n. 


Proof: Theorem 1 states that the scalars {a, ..., @,} are unique for x. We begin by 
assuming that indeed 


Equation: 
Sais = Shs 
i=l i=1 
This implies 
Equation: 
S° aj,Sj = Sob: 5; = 0, 
i=l i=1 
S > (ai — bi) Si =0 
w=1 


Since the elements of S are linearly independent, each one of the scalars of the sum must 
be zero, that is, a; — 6; = 0 and so a; = b; for eachz = 1,...,n. 


Example 5 (Digital Communications) A transmitter sends two waveforms: 
Equation: 


Sy (t) = 4/2/T cos (2xf-t) t€[0,T] if bit 1 is transmitted, 
Equation: 


So (t) = /2/T sin (2rf.t) t¢€[0,T] if bit 0 is transmitted. 


The signal r(t) recorded by the receiver is continuous, that is, r(t) € C[T]. Assuming 
that the propagation delay is known and corrected at the receiver, we will have they the 
received signal must be in the span of the possible transmitted signals, i-e., 

r(t) Espan (S$; (¢), So (¢)). One can check that S} (t) and S$» (t) are linearly 
independent. Thus, one can use a unique choice of coefficients ag and a; that denote 
whether bit 0 or bit 1 is transmitted and contain the amount of attenuation caused by the 
transmission: 

Equation: 


T (t) =aj,S} (t) + aoSo(t). 


The uniqueness of this representation can only be obtained if the transmitted signals 
So (t) and Sj (t) are linearly independent. The waveforms above are used in in phase 
shift keying (PSK); other similar examples include frequency shift keying (FSK) and 
quadrature amplitude modulation (QAM). 


Normed Spaces 
Description of norms, normed spaces, and Banach spaces 


Distances and metrics allow us to evaluate how different two signals are 


from each other. Norms allow us to evaluate how “big”, “important”, or 
“interesting” a given signal is. 


Definition 1 Assume X is a linear vector space. A norm on X is a function 
|| - ||: & — R with the following properties for all z, y € X anda € R: 


1. || x ||> 0, (non-negativity) 

2. || x ||= 0 if and only if x = 0 (zero norm for zero vector) 
3. || a- x ||= |al- || a ||, (scaling) 

4. || ¢+y ||<|| a || + || y ||. (wiangle inequality) 


|| x || is read as the norm of x or length of z. 


Intuitively, one can say that || x || is the distance between x and the zero 
vector (more on this soon). 


Definition 2 A vector space X with a norm || - || is called a normed linear 
vector space(X, || - ||) (or a normed space for brevity). 
Definition 3 Let (X, || - ||) be a normed space. The induced metric or 


induced distance is given by dz (x, y) =|| x — y |. 


Definition 4 If a normed space is complete under the induced metric, then it 
is called a Banach space. 


All norms induce distances, but not all distances are induced by norms. 


Example 1 Consider the distance 
Equation: 


0 if x=y, 


aoa) = . if cf y. 


Let us assume that there exists a norm || x ||; = d’ (x, 0) that would induce 
this distance. We would then have for x 4 0 anda ¢ {—1, 0, 1} that 

|| ax ||; = d’ (axz,0) =1and || z ||; =’ (x, 0) = 1, which contradicts 

|| ax ||; = |a| || x ||;. Thus || - ||, is not a valid norm. 


In contrast, here are some examples of valid norms. 


Example 2 The vector space X = C|T| accepts the norm 

| z ||, =supyer |x (t)|. The induced distance is 

d; (x,y) =supter |x (t) — y(t)| = dq (2, y); it is straightforward to 
prove properties (1—4). We previously showed that the metric space 
(C'|T], d..) is complete, and so (C'[T], || - ||.) is a Banach space. 


Example 3 The vector space _X = R” accepts the norm 


1/2 
Eales ( a |”) . The induced metric is 


1/2 

n 2 

di(z,y) = ||e@—yllp= pe jzi — yi ) = dz (x,y), the 
Euclidean distance. Thus || - ||, is known as the Euclidean norm. The 
spaces (IR”, dz) are Banach spaces for all values of n. 


Example 4 The vector space X = |0, 1) accepts the norm || x ||= |z|. The 
induced metric d; (x, y) = |x — y| = do (a, y) is the standard metric for 
the reals. Since we have previously shown that (X, dg) is not a complete 
vector space, then the space ((0, 1], || - ||) is not a Banach space. 


Topological Concepts 
Review of basic topological sets 


Interiors and Closures 


Definition 1 Let (X, d) be a metric space. The ball B(x, €) centered at 
xo © X and of radius e > 0 is defined as 
B(ao,€) = {x € X: d(x, 20) < e}. 


If (X, || - ||) is a metric space, then B (ao, €) = {x € X :|| x—2o ||< e}. 


Example 1 In the metric space (R?, dz), the ball B(po, €) is a circle 
centered at pp and of radius e, as illustrated in [Link]. 


A euclidean/é5 ball with 
center po and radius € 


The following four definitions are fundamental in the study of topology. 


Definition 2 Let P C X where (X, d) is a metric space. A vector po € P is 
an interior point of P if there exists € > 0 such that B(po, €) C P. 


Intuitively, the notion of an interior point is a point that is not in the 
“boundary” of the set, as a ball around it is contained within the set. 


Definition 3 The interior of a set P is the collection of all the interior 
points of P and is denoted as P°. 


Intuitively, closure points are points that are arbitrarily close to the set P; 
note however that a closure point need not be in P, but only have a 
sequence of elements of P that converge to it. 


Definition 4 A point p; € X is a closure point of P if for all € > 0 we have 
that B(p1,e) IP #90. 


Definition 5 The closure of a set P is the set of all closure points of P 
denoted as P. 


Open and Closed Sets 
Topology is the study of open and closed sets, defined below. 


Definition 6 A set P is said to be open if P = P°, i.e., every point in P is 
an interior point of P. 


Definition 7 A set P is said to be closed if P = P, i.e., P contains all its 
closure points. 


Fact 1 Since all interior points of P are in P and every point in P is a 
closure point of P, we have that P° C P C P. 


Example 2 The following are examples of open and closed sets. 


e The set [a,b] = {2 € R: a < a < dD} is closed. 

¢ The set (a,b) = {@ € R: a < «& < D} is open. To show this, we must 
show that every point x € (a, b) is an interior point. Pick an arbitrary 
az € (a,b), and define € =min (*5*, ie ). Then the ball 
B(a,e) = {we R: |u—2| < e} can be rewritten as the set of all 


points u such that -e + x < u< e+. Using the definition of €, we 
have that if u € B(z, €) then 
Equation: 


or equivalently, 
Equation: 


OO se b+az 
Qe 2 
Since a < x < b, we have 


Equation: 


b 
r+a cus" <p, 


a< 


and so u € (a,b). Since u € B(z, €) was arbitrary, we then have 
B(a,€) C (a,b) and z is an interior point of (a, b). Now since 
x € (a, b) was arbitrary, then the set (a, b) is open. 


Properties of Open and Closed Sets 


Theorem 1 (i) If A is open then A® is closed. (di) If A is closed then A° is 
open. 


Proof: (2) We will prove by contradiction: Assume A is open and A® is not 
closed, that is, there exists a closure point x of A© such that x ¢ A“, that 
is, z € A. Since z is a closure point of A©, we have that for every « > 0, 
Equation: 


B(z,6)N AC FO. 


Since A is open and x € A, then z is an interior point of A, which means 
that there exists €9 > 0 such that B(x, €9) C A, which means that 


B(a,€)N A® = 0, a contradiction with [link]. Therefore, we must have 
that A® is closed. 


(iz) Assume A is closed, which means that A contains all its interior points. 
That means that if x € A© then z is not a closure point of A, meaning that 
there for some €) > 0 we must have B(x, €9) 1 A = @. This means that 
B(x, €9) C AY, and so z is an interior point of A®. Since x was an 
arbitrary point in A®, this means that A® is open. 


Proposition 1 The intersection of a finite number of open sets is open, and 
the union of an arbitrary collection of open sets is open. 


Proof: We will limit the proof to two sets, which can be extended in each 
case using a proof by induction argument. 


We first show that if A,,A» are open then A; Ag is open, i.e., 

(ALN Ay)? = A, Ao. Assume x € A, Ag; then x € A; and z © Ap. 
Since A;,Ag are open then there exists €1,€2 > 0 such that B(a,€,) C Ay 
and B (a, €2) C Ap. Set € = min(ey, €2); then, B(x, e) C B(a, €1) and 
B(a,e€) C B(a, eg). By transitivity of inclusion, we have that 

B(a,¢) C A, and B(a,¢€) C Ap. Theferore, B(x,€) C Ar Ao. 


Next, we show that if A ,A» are open then A, U Ag is open, i.e., 

(Ay U Ay)? = A, U A». Assume x € A; U Ap; then z € A; or Ao. If 

x € A, then there exists €; > 0s.t. B(x, €1) C Ay C Ay U Ag. Similarly, if 
x € Ag, there exists €2 > 0s.t. B(x, €2) C Ag C Ai U Ap. So 

x € A; U Ag is an interior point of A; U Ag and therefore 

(A; U Ag)? = Ay U Ap. 


Proposition 2 The union of a finite number of closed sets is closed. The 
intersection of an arbitrary collection of closed sets is closed. 


The following useful properties are proven in “Optimization by Vector 
Space Methods” by David Luenberger, pages 25 and 38. 


Proposition 3 If C is convex then its interior C° and closure C are convex. 


Proposition 4 A subset of a Banach space is complete if and only if it is 
closed. 


Proposition 5 Any finite dimensional subspace of a normed linear space is 
complete. 


Why are Banach spaces useful? In optimization, we want to show that if an 
increasingly better solution can be found then an optimum must exist. 


Inner Product Spaces and Hilbert Spaces 
Review of inner products and inner product spaces. 


Inner Products 


We have defined distances and norms to measure whether two signals are different from each 
other and to measure the “size” of a signal. However, it is possible for two pairs of signals with 
the same norms and distance to exhibit different behavior - an example of this contrast is to pick 
a pair of orthogonal signals and a pair of non-orthogonal signals, as shown in [link]. 


d(x1,x2) we 


Orthogonal Non - Orthogonal 


An example of orthogonality in a two-dimensional 
space: distances and norms are not indicative of 
orthogonality; two pairs of vectors with the same 
distance can have arbitrary angle between them. 


To obtain a new metric that distinguishes between orthogonal and non-orthogonal we use the 
inner product, which provides us with a new metric of “similarity”. 


Definition 1 An inner product for a vector space (X, R, +,-) is a function (-,-): X x X > R, 
sometimes denoted (-|-), with the following properties: for all x,y,z € X anda € R, 


1. (a, y) = (y, £) (complex conjugate property), 

2. (x + y, z) = (x, z) + (y, z) (distributive property), 
3. (ax, y) = a(x, y) (scaling property), 

4. (x, x) > Oand (x, x) = Oif and only ifz = 0. 


A vector space with an inner product is called an inner product space or a pre-Hilbert space. 
It is worth pointing out that properties (2-3) say that the inner product is linear, albeit only on the 


first input. However, if R = R, then the properties (2-3) hold for both inputs and the inner 
product is linear on both inputs. 


Just as every norm induces a distance, every inner product induces a norm: ||x||; = «/ (x, x). 


Hilbert Spaces 


Definition 2 An inner product space that is complete under the metric induced by the induced 
norm is called a Hilbert space. 


Example 1 The following are examples of inner product spaces: 


1. X = R” with the inner product (x, y) = )°°_, x;y; = y? x. The corresponding induced 
norm is given by ||a||, = «/(x, 2) = roe x? = ||z\|,, ie., the £2 norm. Since 


(R”, || - ||2) is complete, then it is a Hilbert ae 
2. X = C/T] with inner product (z, y) = [ae t)dt. The corresponding induced norm is 


elle 47 facel t)dt = ||a||,, ie., the LD» norm. 


ae i we ne o ei C|T] to be complex-valued, then the inner product is defined by 
=m re t)dt, and the corresponding induced norm is 


ae = iz tat = 4/ fp la (t) [Pat (at = [ll 
4, X = C” with inner ee (x,y) = S_, viyi = ya; here, x” denotes the Hermitian 
of x. The corresponding induced norm is ||z||; = gs |xi|” = |la||o. 
Theorem 1 (Cauchy-Schwarz Inequality) Assume X is an inner product space. For each 


x,y © X, we have that |(x,y)| < ||z||,||yl|,, with equality if (2) y = ax for some a € R; (it) 
ax =. 03 or (222) yy = 0: 


Proof: We consider two separate cases. 


° ify =O then (x,y) = (y,a) = (0-y,2) = O(y,x) = 0(z,y) =0= || a |I,|| y ||;. The 
proof is similar if x = 0. 
° Ifz,y #~ OthenO < (x — ay, x — ay) = (x, 2) —aly,z) —a(z,y) + aa (y, y), with 


equality if z — ay = 0,i.e., x = ay forsomea € R. Now set a = eu , and soa = wet . 
We then have 
Equation: 

(x, y) (y; ) (x,y) (y, #) 

0 < (a,x) — (y, x) — (x,y) + (y, 9) 
(YY) (y, ¥) (Y¥) (Ys ¥) 
(x,y) (£,y) l(a, y) |? 
S$ (2,2) - PSS = |le|?- 2 
all” 1 

This implies Kewl < Teale and so since all quantities involved are positive we have 


ly i 
(x, ¥)| < [lel] - lly. 


Properties of Inner Products Spaces 


In the previous lecture we discussed norms induced by inner products but failed to prove that 
they are valid norms. Most properties are easy to check; below, we check the triangle inequality 
for the induced norm. 


Lemma 1 If || 2 ||; = (2,2), then || ¢+y ||; < | 2], + lly Ih. 


From the definition of the induced norm, 
Equation: 
j@+yll; =(e+ye+y), 
= (x, a) + (@,y) + (y,@) + (ys ¥)s 
= |lell; + (2,9) + (2,9) + [yl 
= ||2||; + 2real ((x,y)) + [lyll;- 


At this point, we can upper bound the real part of the inner product by its magnitude: 
real ((x, y)) < |(x, y)|. Thus, we obtain 
Equation: 
lx tlle < llelly +2\(e,y)] + lal, 
S |lell? + 2llelslylls + lvl, 
S (lll; + llyill)’, 


where the second inequality is due to the Cauchy-Schwarz inequality. Thus we have shown that 
|x + yl; < |x|; + |ly||,- Here's an interesting (and easy to prove) fact about inner products: 


Lemma 2 If (x,y) = 0 for all x € X then y = 0. 


Proof: Pick x = y, and so (y, y) = 0. Due to the properties of an inner product, this implies that 
y= 0. 


Earlier, we considered whether all distances are induced by norms (and found a 
counterexample). We can ask the same question here: are all norms induced by inner products? 
The following theorem helps us check for this property. 


Theorem 2 (Parallelogram Law) If a norm ||-|| is induced by an inner product, then 
lo + yl? + lla — yl? = 2 (Tall? + llgll”) for all x,y € X. 


This theorem allows us to rule out norms that cannot be induced. 


Proof: For an induced norm we have ||z||” = (a, x). Therefore, 
Equation: 


jz+yll’+lz—yll? =@+y,c+y)+ @—y,2-y), 
as (@, 2) =e (x, y) + (Y, x) ae (y, y) ale (x, 2) p (x, y) _ (Y, x) ‘lg (YY); 


Example 2 Consider the normed space (C |T], Lo), and recall that ||z||,, =supzer |x (t)|. If 
this norm is induced, then the Parallelogram law would hold. If not, then we can find a 
counterexample. In particular, let T = (0, 27], x(t) = 1, and y(t) =cos (t). Then, we want to 
check if ||z + y||° + ||e — yl’ =2 (\lz? oF lull”). We compute: 


Equation: 


hoo 


lly [o-e) = 1, 
lz + yl|,¢ = ||1+ cos (£)|| =sup |1+ cos (¢)| =1+1=2, 
teT 


lz — y||,,. = |]1- cos (€)|| =sup |1— cos (t)| = 1 — (—1) = 2. 
teT 


Plugging into the two sides of the Parallelogram law, 
Equation: 


27-2? = 2(1? 4-17), 
8 =4, 


and the Parallelogram law does not hold. Thus, the Z., norm is not an induced norm. 


The ell_2 space 
Formulation of the ell_2 space 


Definition 1 We define the £2 space of infinite sequences of finite energy as 


Equation: 


ly = {(21, £2, 23,.--) € R*such that S~ |aj|?< oo}. 
=1 


— 


On this space, define the inner product (x, y) = >>", x:Yi, and obtain the 


induced norm ||z||, = Nppoa |z;|°, which we term the £) norm. 


Here IR“ refers to the set of all infinite sequences of real values. Note that 
this implies that the sequence |x;| must converge to zero as 1 + oo. Note 
also can then say that the £9 space consists of all infinite sequences with 
finite 2. norm. 


Theorem 1 The £2 space is a Hilbert space. 
To prove this theorem, we need the following lemma. 
Lemma 1 A Cauchy sequence in a normed space is bounded. 


Proof of Lemma 1: Let {x7} be a Cauchy sequence and let no be an integer 
such that || tz, — 2p, ||< 1 forn > no. Forn > N, we have 

| @n =|] nu — Vng + ng ||SI] La — Lng || + |] @no [|S 1+ |] tao Il. 
Now set M =maxi<n<ny|| Ln || +1. Then, || xn ||< M for all 

= BD cays: 


Proof of Theorem 1: We show that £2 is complete by proving that all 
Cauchy sequences converge in £2. Assume that {a(n) is a Cauchy 


sequence in £5. Then, if 2”) = Ce ah”), ) , we have that there exists 


some mg such that if 7,k > mo, then for each a = 1, 2,..., 
Equation: 


k 00 k) |2 
a) — a *) < Done ao) — gz! ‘ = |29 —2|| <e 
Therefore, each sequence (ee for each a = 1, 2,--- is a Cauchy 


sequence in the space (C, dg), which is complete. Thus, each sequence 


{en} must converge in C to some value z,. 


Define x = (a1, 25, rae We show that x € £9. According to Lemma 1, 
the sequence {,,} is bounded by some constant M. For each pair 

a,k = 1,2,..., we have 

Equation: 


(n) 


2 
al" < |a™ |), < M?. 


oS 
I 
a 


This inequality is valid for each value of n, and so we must have 
Equation: 


#12 
x <M’. 


Additionally, his inequality is valid for each value of k, and so we must 
have 
Equation: 


* 


2 
< M?. 


3 


i=1 


L 


a 


Thus, we have shown that x € £9. The last point we need to show is that 
'™) 5 a, First, since the sequence {x} is Cauchy, we have that there 


exists an no such that if 7, > no and for each / = 1, 2,..., we have 
Equation: 


l 
. 2 ; 
af) =a," |" < |2@ — 2, < =. 
i=1 4 


Observe also that for each 7 there exists no; such that if k > moj, then 


k i i ; 
a Bes eo aes; therefore, if k >max (no, sup; N9,;,), we have 
Equation: 
S|) (i) _ ,(k) |? (eed 
Le — 2; 2 |e 2, | + |e} — 2; 


where the first inequality come from the fact that |a + b|” < 2({a|? + el”) 
, which is easy to check. Since this is true for each / = 1, 2,..., then it 


* 
follows that if 7 > np) :=max (no, sup; No,:), then 
Equation: 


y, 
<< €: 


29 2" |=" |2 - a; 
i=1 


This shows that x) > 2°. 


Orthogonality 
Definition of orthogonal vectors, sets, and subspaces; properties and benefits. 


Recall that a set S is a basis for a subspace X if [S] = X and S has linearly 
independent elements. If S is a basis for X then each x € X is uniquely 
determined by (a1, a2, ...,@n) such that }>\"_, a;s; = a. In this sense, we could 
operate either with z itself or with the vector a = |aj,...a,,] = R”. One would 
wonder then whether particular operations can be performed with a representation 
a instead of the original vector x. 


Example 1 Assume xz, y € X have representations a, b € R” in a basis for X. 
Can we say that (x, y) = (a,b)? 


For the particular example of X = Lz [0,1], S = (a, & t?} so that [S] = Q, the 
set of all quadratic functions supported on [0, 1]. Pick « = 2+¢ +t? and 
y=1+ 2¢+ 3¢7. One can see then that if we label s; = 1, s2 = t, s3 = t’, then 
the coefficient vectors for x and y are a = [2 1 1] andb = [1 2 3], respectively. 
Let us compute both inner products: 

Equation: 


_ 187 
— 2 


(x, y) = [ etoumae =f (2+¢+t7) (1+ 2¢ + 32?)de ~ 9.35, 


(a,b) =24+2+4+3=7. 
Since 7 # 9. 35, we find that we fail to obtain the desired equivalence between 
vectors and their representations. 


While this example was unsuccessful, simple conditions on the basis S will yield 
this desired equivalence, plus many more useful properties. 


Several definitions of orthogonality will be useful to us during the course. 


Definition 1 A pair of vectors x and y in an inner product space are orthogonal 
(denoted x | y) if the inner product (zx, y) = 0. 


Note that 0 is immediately orthogonal to all vectors. 


Definition 2 Let X be an inner product space. A set of vectors S CX is 
orthogonal if (x,y) = Oforallz,y€ S,x Ay. 


Definition 3 Let X be an inner product space. A set of vectors S CX is 
orthonormal if S is an orthogonal set and || s |= ./ (s,s) = 1 forall s € S. 


Definition 4 A vector x in an inner product space X is orthogonal to a setS C X 
(denoted x | S)ifx 1 y forall y € S. 


Definition 5 Let X be an inner product space. Two sets S; C X and Sz C X are 
orthogonal (denoted S$; | S2)if x | y forall x € S; andy € so. 


Definition 6 The orthogonal complement S* of a set S is the set of all vectors 
that are orthogonal to S. 


Benefits of Orthogonality 


Why is orthonormality good? For many reasons. One of them is the equivalence 
of inner products that we desired in a previous example. Another is that having an 
orthonormal basis allows us to easily find the coefficients a1, ...a,, of x in a basis 


S. 


Example 2 Let z € X and S be a basis for X (i.e., [S] = X). We wish to find 
Q1,...A,, such that x = 4 a; $;. Consider the inner products 
Equation: 


n n 
(z,8;) = S| 4585, 81 = S/ a;(s;,8:), 


due to the linearity of the inner product in the first term. If S' is orthonormal, then 
we have that fori ¢ 7 (s;, s;) = 0. In that case the sum above becomes 
Equation: 


2 
(@,8;) = a;(8;, 8;) = aj|| 8; ||" = a, 

due to the orthonormality of S. In other words, for an orthonormal basis S one can 

find the basis coefficients as a; = (x, s;). 

If S is not orthonormal, then we can rewrite the sum above as the product of a row 


vector and a column vector as follows: 
Equation: 


a1 


CONC CN ace or | (ie 


an 


We can then stack these equations for 2 = 1, ..., to obtain the following matrix- 
vector multiplication: 


Equation: 
(a, 81) (81,81) (82,81) +++ (8n;81)] [ay 
(x, $2) (81,82) (82,83) *+: (Sn,$2)) |as 
(£, 8n) (81, $n) (82, $n) ae (Sn; $n) an 


The nomenclature given above provides us with the matrix equation 6 = G - a, 
where £ and G have entries 8; = (x, s;) and G;,; = (s;, s;), respectively. 


Definition 7 The matrix G above is called the Gram matrix (or Gramian) of the 
set S. 


In the particular case of orthonormal S, it is easy to see that G = J, the identity 
matrix, and so a = (@ as given earlier. For invertible Gramians G,, one could 
compute the coefficients in vector form as a = G8. For square matrices (like G 
), invertibility is linked to singularity. 


Definition 8 A singular matrix is a non-invertible square matrix. A non-singular 
matrix is an invertible square matrix. 


Theorem 1 A matrix is singular if G - x = 0 forsome x ¥ 0. A matrix is non- 
singular if G - x = 0 only for x = 0. 


The link between this notion of singularity and invertibility is straightforward: if 
G is singular, then there is some a # 0 for which G - a = 0. Consider the 
mapping y = G - x; we would also have y = G(x + a). Since x # x + a, one 
cannot “invert” the mapping provided by G into y. 


Theorem 2 S is linearly independent if and only if G is non-singular (i.e. Gaz = 0 
if and only if « = 0). 


Proof: We will prove an equivalent statement: S is linearly dependent if and only 
if G is singular, i.e., if and only if there exists a vector x ~ 0 such that Ga = 0. 


(=) We first prove that if S is linearly dependent then G is singular. In this case 
there exist a set {a;} C R, with at least one nonzero, such that }>""_, ais; = 0. 
We then can write ()>”_, a;8;,8;) = (0,8,;) = 0 for each s;. Linearity allows us 
to take the sum and the scalar outside the inner product: 

Equation: 


n 
S° aj(S;, 8;) = 0. 
i=1 


We can rewrite this equation in terms of the entries of the Gram matrix as 
Soi, aGj; = 0. This sum, in turn, can be written as the vector inner product 
Equation: 
Qa) 
[Gj +++ Gin] |} : | =0, 


an 


which is true for every value of 7. We can therefore collect these equations into a 
matrix-vector product: 
Equation: 


Gu -*: Gin} [a 0 
Gni a Gann an 0 
Therefore we have found a nonzero vector a for which Ga = 0, and therefore G 


is singular. Since all statements here are equalities, we can backtrack to prove the 
opposite direction of the theorem (<=). 


Pythagorean Theorem 


There are still more nice proper ties for orthogonal sets of vectors. The next one 
has well-known geometric applications. 


Theorem 3 (Pythagorean theorem) If x and y are orthogonal (a | y), then 
2 2 2 
Fei +iy i =lle+y|/. 


Proof: 
Equation: 


|oty |? =(e+y,a+y) = (a,x) + (x,y) + (y,2) + (yy) 


Because x and y are orthogonal, (x, y) = (y, x) = 0 and we are left with 
2 2 
(e, 2) = || # ||" and (y,y) = || y ||. Thus: || 2 + y ||" =|] 2° +l yl’. 


Gram-Schmidt Process 
Description of the Gram-Schmidt procedure to formulate orthonormal bases. 


The Gram-Schmidt algorithm or procedure is used to find an orthonormal basis for the subspace 
[S], even if S is not linearly independent. The algorithm is formally defined as follows: 


¢ Inputs: Set of vectors S = {51,---, Sn}. 
¢ Outputs: Orthonormal basis elements W = {w1,---,w,} that span the same space: 
[W] = [5]. 


e Procedure: 


1. Take the first element of the set and divide it by its norm (that is, normalize the 
element): 
Equation: 
$1 


ian 


W1 
2. Take the second element and subtract the projection into the first basis element: 
Equation: 
to = 8&2 — (82, W1)W1. 


Normalize the result t9: 
Equation: 


| é2 || 


W2 


It is easy to check that w, and wz are orthonormal: 
Equation: 


t 1 
(wi, W2) _ (wr, a4) = ——- (wi, 82 oa (S89, W1)W1) 
2 


- Tal 
1 ——— 1 Tas) 
= Tey Me) = (S2, wi) (wi, wr) ) = || t2 || ((wr, 82) _ (32, i)) 
1 
“Tel 


Thus, w , and wy are orthonormal. 

3. The second step is repeated for each additional element; i*” element follows the 
following formula: 
Equation: 


When the set S' includes linearly dependent vectors, some of the unnormalized vectors t; = 0, as 
the projections will cancel out with some elements of S. As a result, the number of vectors 
needed will be higher than the dimensionality of the space; in other words, the dimensionality of 
[S] will be smaller than the cardinality of the set |S]. 


Example 1 Let S = {1, UE ae where s; = 1, s2 = tand s3 = t?. We can therefore write the 
set of quadratic functions as Q = | and S$ C C(T : a T = (0, 1]. Recall that for this 
space, the inner product is written as (2, y) = fix 9 x t)dt. We obtain a basis for Q using the 


Gram-Schmidt procedure: it requires us to compute ane norms and inner products on the 
way. 


1 8 
1. Solve for wy :|| sy ||= / (51,81) = Vf; 1-1dt = 1, and so w, (t) ol 4 1. 


2. Solve for we : 


Equation: 
1 2 1 
(s9, w1) =| t+ 1ldt = — = =y 
0 ~ t=(0,1] 2 
1 1 
to (t) = 82 (t) — (82, w1)wi (t) =t— inane 7. 
: >) 1 
ta || = yea, te) = if'( cs }-(@- aye 


on 
t=[0,1] = qo? 


ve anes tan f(G-$+9) 


ee 2 = Vine YP = avn v3. 


It is easy to check that wz has unit norm and is orthogonal to wy}. 
3. Solve for w3 : 
Equation: 


(53, W1) 


(53, W2) 


1 


t=[0,1] = 3° 


1 t3 
=i t? -1dt = — 
0 3 


: V3 1 1 
= t?2.(2 4/3 \Gp= 14 13 = 
/ (2v3t - v3) (3 alae = 
1 1 = 
= 83 — (83, w1)w1 — (83, W2)wW2 = t? — 3 oF (2¢v3 = v3) 
a 
98 
=P -t-<, 
————— 1 1 2 
= / (tats) = / (#-+- x) dt, 
0 6 
_ _ ts 
|| t3 || 


We can check that [W] = Q. 


Projection Theorem 
Introduces the concept of subspace projection and the optimality given by the projection theorem. 


Uniqueness of Decompositions 

For the next property, we need two quick definitions. 

Definition 1 The sum of two spaces S$, S2 C X is the set $1 + Sp = {a+y:2€ Si,y € So}. 
Definition 2 The subspaces S, 5, C X are disjoint if S$; S2 = {0}. 


Theorem 1 Let V, W C X be subspaces. For each x € V + W there exists a unique pairv € V,w © W 
such that x = v+ w if and only if V and W are disjoint. 


Proof: We prove the two directions of this “if and only if” statement separately. 


(=) If for each x € X there exists a unique pair v, w € V x W such that z = v + w, then the subspaces V 
and W are disjoint. 


Assume there exists a unique pair v, w s.t. 2 = u+ w for each x € V + W. For the sake of contradiction, 
assume V and W are not disjoint, i.e. there exists some z € VM W such that z # 0. Pick x and the 
corresponding v, w such that z = v + w. Then, 
Equation: 
L=v+wt+Zz-Z, 
=vU+zZ+w- Zz. 


We examine these two terms in the equation: 


Equation: 
v+z:vEV, zEVSev4ZzeEV, 
Equation: 
w-—z:wewWw, zewWs>w-zew. 
Also note 
Equation: 


v+zA#Av and w—zAw since z£0. 


So we have two pairs of elements from V and W which sum to 2, a contradiction to the assumption that these 
pairs are unique. Therefore, V ™ W = {0} and thus V and W are disjoint. 


(<=) If V and W are disjoint, then for each « € V + W there exists a unique pair v, w € V x W such that 
v+t+Ww=t. 


To prove uniqueness, we will assume that two distinct pairs exist and show at the end that they are qual to 
each other. The two pairs we begin with are 
Equation: 


z=v+u,veV andwe W, 


Equation: 
z=pt+qpeVandqew, 

Equation: 

p#vorg#w, 
so the pairs are distinct from each other. Subtract the equations: 
Equation: 

0 =p-—vt+q-uv, 

w-q =p-v. 

We examine these two terms in the equation: 
Equation: 


p-—v:pEV, veVaep-veV, 


Equation: 


w-q:weEWw,qewWw>aw-qew. 


Since those two terms are equal, 
Equation: 


w—qeV and p—vew. 


which means that w — g © VM W and p— v € VN W. Our starting assumption was that VM W = {0}, so 
therefore w — g = 0 and p — v = 0 which implies w = q and p = v. This statement is a contradiction since 
we assumed that the pairs were distinct, i.e., w 4 q and p # v. Therefore, only a unique pair v € V, w © W 
exists such that v + w = z. 


Fact 1 If S is a subspace, then S and S~ are disjoint. To see this, assume « € S$ S*, which means « € S+ 
and so (x, y) = 0 forall y € 9. Pick y = x so (x, x) = 0 > 2 = 0 which means SM S~ = {0}. 


Fact 2 Using Fact 1, we can show that for each x € X there exists a unique pair s € S andt € S* such that 
x = s+. In particular: 


eIfxeS,s=azandt=0. 

° Ife S+,s=Oandt=z. 

e Ife ¢ Sandz ¢ S+, boths,t £0. 
Projection Theorem 


In fact, there is quite a lot more we can say about the projection of a point x € X into a subspace S C X 


Theorem 2 Let X be a Hilbert space and S' be a closed subspace of X. For any vector  € X there exists a 
unique point so € S that is closest to a, ie., 
Equation: 


so =min|| x — s || 


In other words, 


Z—Sp || < || « — 5 || for all s € S with equality only if s = sp. 


Furthermore, so is a minimizer of || x — s || over s € S if and only if x — sp 1 S. In other words, 
2 — 39 € St. 


Proof: To structure our proof, we will restate the theorem into four separate facts: 


1. Existence: A minimizer of the distance || x — s || exists in S. 

2. Orthogonality of the error: If sq is a minimizer, then z — s) | S. 

3. Sufficiency of orthogonality: If « — sq  S, then sq is a minimizer to the distance function. 
4. Uniqueness: only one point that minimizes the distance exists in S. 


Each of these facts is proven separately. 


1. Existence: If « € S, then sg = a is the minimizer (since 0 is the minimum distance). If z ¢ S, we denote 
5 =inf,<g|| 2 — s ||. Note here that we use the infimum, which is the upper bound on all the lower bounds on 
the distance. The infimum is used because we do not yet know if there exists a point in S' that has lowest 
distance to x. Note also that an infimum exists because norms have a lower bound. 


We need to show that for some so € S, we have || 2—So ||= 6. Let {s;} C S be a sequence that yields 
|| 2—s; || 6. We will show that this sequence is a Cauchy sequence; thenm using the fact that the space S is 
closed and Hilbert, it is complete and therefore the Cauchy sequence must converge to a point sy € S. 


To prove that the sequence is Cauchy, we use the Parallelogram Law: 
Equation: 


I (8; 2) + (@ — 5) | +11 (83-2) — (@ — 5) [P_ = 2 (I sj-2 IP + 2-5 IP), 


|| 83-8: |? =2 |] sj;—2 |? +2 || e—s; ||? -|| 83 + 8: — 22 |)?, 
8; + 8; a 
I| $)— 8: |?) = 2 || s;—2 ||? +l] « — 8; ||? - 4||— 5 7 gl . 


Since S is a subspace, —- € §so |la — —— || > 6°, since 6 is the infimum. Therefore, 
Equation: 


| 8) — $4 PS 2 |] 2s; ||? +2] 8; — @ ||? — 40". 
At this point, we note that || s; — s; ||> 0 asi, 7 —> oo, and so the sequence can be shown to be a Cauchy 


sequence. Since {s;} is Cauchy, it converges to a point so and by the triangle inequality: 
Equation: 


|| c—so ||<|| e—s; || + || s; — $0 || for each value of j = 1,2,... 
Now the first term of the inequality goes to 6 as 7 — oo and the second term goes to 0, and so we have that 


|| c—so ||< 6 + € for all values of € > 0. Since 6 was the infimum of the norm on the left hand side, it 
follows that || «—so ||= 6, and so we have found that a minimizer exists. 


2. Orthogonality of the error: We proceed by contradiction by assuming x — SQ is not orthonormal to S, i.e., 
that there exists a unit-norm vector § € S such that (x — 89, 8) = 640. Let z= so + 68 € S. Then 


|| c—so ||<|| 2 — z ||. Furthermore, 
Equation: 


|a—z|? =(@-—za2-2z) = (a — 89 — 68, a — 89 — 68), 
= (£ — 89, — So) + (58, 58) — 2Re((x — 80, 68)), 
=|| e—so ||? +16)7|| 3 ||? — aRe(5 (a = s0,68)). 


Now we recognize two simplifications. First, we selected & so that || & |/?= 1; second, the remaining inner 
product is equal to —2Re (5 (x — 80, 68) = —2Re (55) = —2|6|?. Therefore, 
Equation: 


|| @—z |? =|] e—s0 |? -|d] < || @— 80 ||’ since 5 £0. 


This is a contradiction, since we have found a z € S closer to x than so. Thus (x — so, s) = 0 forall s € S, 
which means that x — so | S. 


3. Sufficiency of orthogonality: Assume x — so | S. For any § € S with 5 ¥ so, we have 

|| e—3 ||2= || w — s9 + 89 — § ||”. We use the Pythagorean theorem (if a L 6 then 

| a+b ||? =|] a ||? + || b |\?): since a — sy L Sand sy — 5 € S, we have 

|| e—F ||2=|| 2—so ||? +|| so — 5 ||’. Since so 4 3, we have || s9 — 5 ||> 0, and so || —S ||2> || 2 — s9 ||’. 
Now, since we picked § € S arbitrarily, it follows that sp is the minimizer of || z — s || overs € S. 


4, Uniqueness: We can write = « — so + so. From previous parts, we know that x — sy € Stand so € S. 
Since S and S$“ are disjoint, only one pair of values x — so and zp that holds these two properties exists. 
Therefore, the minimizer so is unique. 


Least Squares Estimation in Hilbert Spaces 
Presents leasts squares estimation in subspaces of Hilbert spaces, with applications. 


Projections with Orthonormal Bases 


Having an orthonormal basis for the subspace of interest significantly simplifies the projection 
operator. 


Lemma 1 Let x € X, a Hilbert space, and let S be a subspace of X. If {bj, ba, ...} is an orthonormal 
basis for S, then the closest point so € S to a is given by so = 5°, (ax, bi) bi. 


We begin by noting that 
Equation: 


do (a, bi)bs = Dw — 80 + 80, bi)bi =D) (w — 80, bs)bi + D ) (So, bi) bi. 


4 4 4 


Now, since Sq is the projection of x onto S, we must have that x — s 9  S, and so for each basis 
element b; we must have (a — so, b;) = 0. Additionally, since s9 € S and {by, bo, ...} is an 
orthonormal basis for S, we must have that 5°, (so, ;)b; = So. Thus, we obtain 

Equation: 


e @; b;)b; = $0, 


t 


proving the lemma. 


Application: Communications Receiver 


Consider the case of a communications receiver that records a continuous-time signal 

r(t) = s(t) + n(t) over 0 < t < 1, where s(t) is one of m codeword signals {s1 (€), ..., $m (t)}, and 
n(t) is additive white Gaussian noise. The receiver must make the best possible decision on the 
observed codeword given the reading r(t); this usually involves removing as much of the noise as 
possible from r(t). 


We analyze this problem in the context of the Hilbert space L» [0,1]. To remove as much of the noise 
as possible, we define the subspace S = span({s; (), ..., $m (t)}). Anything that is not contained in 
this subspace is guaranteed to be part of the noise n(t). Now, to obtain the projection into S, we need 
to find an orthonormal basis {e; (¢), ..., en (t)} for S, which can be done for example by applying the 
Gram-Schmidt procedure on the vectors {sj (t), ..., 8m (¢)}. The projection is then obtained according 
to the lemma as 

Equation: 


n 


rs(t) =D (r(@), 6: (ei (4), 


i=1 


where (r (t), e; (t)) = fo 7 (thes (t)dt. 


After the projection is obtained, an optimal receiver proceeds by finding the value of & that minimizes 
the distance 
Equation: 


dy (rg (t), sx (t)) = / lrg (t) — sp (t)/’dt = / rg(t)*dt + / s,(t)*dt —2 / rg (t) sp (t)2de; 


0 0 


note here that the first term does not depend on k, so it suffices to find the value of k that minimizes the 
“cost” 
Equation: 


a =f “su(t)Pdt —2 / “rs (t)se (t)2at = (sx (t), 84 (2)) ~ 2 (rs (54 (8) 


In practice, the codeword signals are designed so that their norms || s;(t) ||, = / (sx (t), sx (t)) are 
all equal. This design choice reduces the problem above to finding the value of & that maximizes the 
score 

Equation: 


Thus, the receiver can be designed according to the diagram in [Link]. 


Compute 
J 
Ck 
for each k; 
use precomputed 


Output value of k 


Values for that minimizes c;, 


I|se(¢) |? 
(ei(t), s(t) 


€n(t) 


Diagram of a communications receiver designed in accordance with the projection theorem. 


Least Squares Approximation in Hilbert Spaces 


Let yj, ..., Un be elements of a Hilbert space X and define the closed, finite-dimensional subspace of 
X given by S = span(y1, ..., yn). We wish to find the best approximation of z in terms of the vectors 
y;, that is, the linear combination ye aiy; with the smallest error e = x — er aiy;. To measure 
the size of the error, we use the induced norm || e ||= || — }>7_, a:yil|. 


To solve this problem, we rely on the projection theorem: we are indeed looking for the closest point to 
z in S = span(y1, ..., Yn). The projection theorem tells us that the closest point s9 = 5+; a;y; must 


give x — so 1 S,i.e., e | S, which implies in tum that (x — i iis Yj) = 0 forall 9 = 1, a.m 
The requirement can be rewritten as (x, yj) = (S74 Yi, Yj) = Yop_1 Gi (Yi, Yj) for each 

7 =1,...,n. These requirements can be collected and written in matrix form as 

Equation: 


(x, Y1) (yt, Y1) (yo; Y1) vee (Yn y1) ay, 

(x, Y2) 7 (yi, Y2) (Y2,¥3) ++ (Yns¥2)]| Jas 

(2, Yn) wise wan ee aa) An 
— \ a ae 


The coefficients of the best approximation can then be obtained as a vector a = G~'{, as long as the 
Gram matrix G is invertible, i.e., it has a nonzero determinant. 


In the case that z and y; are complex-valued vectors, one can rewrite the approximation 

yo ays = Ya, where a is the coefficient vector denoted above and Y = [y; ... yn] is a matrix that 
collects the vectors y; as its columns. The projection theorem requirement then becomes 

(x — Ma, y;) = 0 for j = 1...,n, which can be rewritten as y// (2 — Ma) = 0 and collected as 


before into the matrix equation 


Equation: 
Mj" (2— Ma) =0, 
M#z—M#"Ma =0, 
M"*#*Ma = Mz, 


a =(M#M)”* 


M¥ z= Miz, 


which is known as the least squares solution and exists as long as M# M is an invertible matrix. Once 
a : Bh isso a -1 
these coefficients are obtained, the approximation is equal to = Ma = M (M AM ) M# zg; 


-1 sid 
therefore, the matrix Py; = M (M 4M ) M® is known as the projection operator. 


Application: Channel Equalization 


We consider a linear channel with impulse response h that maps an input z into an output y: 
Equation: 


roh—-y. 


We wish to design a linear equalizer of impulse response g for the input z so that after it is run through 
the equalizer and the channel of impulse response h the output f ~ z (ie., f is as close as possible to x 


): 
Equation: 
t—ogroh-—f. 
Since the equalizer is linear, the order of g and h can be reversed (this will be discussed in more detail 


later): 
Equation: 


rohog—- f. 


Our design for g will be a finite impulse response filter with tap coefficients g;; the mapping from input 
to output index n is therefore given by 
Equation: 


k 
7=1 


The error in approximating z,, is given by 
Equation: 


k 
en = fn 7 lyn = y GiYn-i — En- 
i=1 


The total error magnitude over N observations is given by 
Equation: 


E=)\ e= (9:Yn-i — Zn). 
0 0 ial 


° 


We want to pose this question in terms of error of approximation into a subspace: 
Equation: 


2 
E = || Mg— 6 |[5. 
By convention, we assume that the values g, = 0 andz, = 0 forn < 0 (i.e., n = 0 is the time of the 


first observation). It can be easily seen that 
Equation: 


9k LN 


formulating M requires a separate study of the sum in [link] for each value of n. For n = 0, 
Equation: 


k k 
So 9iYn-i = So giy-i = 0, 
=| a 


and so terms n < 0 can be ignored. For n = 1, 
Equation: 


k k 91 91 
SS) 9iYn—1 = So giy-i = [yo, Y-1)--- »Y-k-+1| : _ [yo, Dyas , 0 

i=1 =I 
‘ ‘ Gk Ik 


Similarly, for n = 2, 
Equation: 


91 


k k 
SS) 9iYn—1 = So giy2-i = [y1, yo, 0,...,0] 
i=1 i=1 r 

k 


Continuing until n = N, 
Equation: 


k k I 
S- 9n—1 = > giyn—i = [yw-1, Ya, ++ Ya 
=I i=1 

Ik 


The concatenation of these sums as a vector can then be expressed by the matrix-vector product Mg, 
where the matrix M is given by 


Equation: 
Yo 0 0 0 0 
Y1 Yo 0 0 0 
M=| ¥% Y1 Yo 0 0 
Yn-1 YN-2 UYUN-3 YN-4 °** YN-k 


Note that for M to have linearly independent columns (a condition for uniqueness of the solution to g) 
the number of nonzero values of y; must be at least k. In this case, the solution 
Equation: 


g= Mtb=(M™M) ‘M7 


minimizes the error as established in the Projection Theorem. 


Application: Linear Regression 


In linear regression, we are given a set of input/output pairs (z;, y;), i = 1,..., N, and we wish to find 
a linear relationship between inputs and outputs y; = ax; + 6 that minimize the sum of squared errors 


b= pre j @ — (ax,+ b)), As in previous examples, we seek to pose this minimization problem 
in terms of the problem considered by the projection theorem: the error & = || Mg — y || - where M is 


a matrix, y is a vector, and g is the optimization variable vector. One can easily see that the following 
choice achieves the desired equality: 


Equation: 
Ly 1 
ae ; Y1 
M= o=[f), and y= 
an 1 YN 


As before, the solution that minimizes the error is given by 
Equation: 


g=(M™M) 'MTy, 


which exists and is unique as long as G = M? M is invertible, i.e., as long as M has linearly 
independent columns, i.e., as long as not all x; are equal. Now, we see that 


Equation: 

“1 1 N i 

M'M — * eee | fs = 4 a? pean Li 
Ln 1 

| N = bpm ‘| 
N N 2 
Typ)! dia ti Dein Bi 
(M*M) = : - . 


Collecting these results, we have that 
Equation: 


N Yini Biyi — (ok zi) (1m) 
(oe x?) (oh vi) = (ok zi) (ok xy) 
Ni x} — bee wi) 


Least Squares: Rejoinder 


We have studied several examples where an optimization problem can be formulated as 
Equation: 


g =argmin || Mg —6 |[3, 


where / is a matrix and b and g are column vectors of appropriate sizes. 


e When the columns of M are orthonormal (i.e., orthogonal and unit norm), we compute g = Mb, 
i.e., 9; = (Mj, b), where @; is the i*” entry of g and M; is the i” column of M. 
e When the columns of M are linearly independent, we compute g = Mib = (M AM ) “Mo. 


The Moore-Penrose pseudoinverse Mt = (M AM ) ~' M is well defined because the Gram 
matrix G = M¥ M of a linearly independent set is invertible. 

e When the columns of M are linearly dependent, there exists a vector a # 0 such that Ma = 0. 
Thus, if a solution gp to the optimization exists, then gg + a is also a solution: note that 
|| Mgo — 6 ||=|| M (go + a) — 6 ||, and so we lose uniqueness of the minimizer; in fact, we will 
have infinitely many solutions to the minimization. However, there are ways to rank the solutions 
and pick a "favorite", e.g., the solution with the smallest norm. This will be considered later in the 
course. 


The Hilbert Space of Random Variables 
Describes random variables in terms of Hilbert spaces, defining inner products, norms, 
and minimum mean square error estimation. 


Random Variable Spaces 


Probability — Notation Primer 


Definition 1 A random variable x is defined by a distribution function 
Equation: 


P(z) =F x(x) = Prob(X <2) 


The density function is given by 
Equation: 


Definition 2 The expectation of a function g(x) over the random variable z is 
Equation: 


Ex(g(@)|= | 9(2)fx (0) dz 
Definition 3 Pairs of random variables X, Y are defined by the joint distribution 
function 
Equation: 


P(z,y) = Fy (2, y) = Prob (X <—o.Y¥ = y) 
The joint density function is given by 
Equation: 


0°P (x,y) 
Oxdy 


0? Prob(X < 2,Y < y) 
Oxdy 


= fxy (z,y) = 


The expectation of a function g(x, y) is given by 


Equation: 


Exyla(ea)l= f : il “Ged tea) teds 


A Hilbert Space of Random Variables 


Definition 4 Let {Y,,---, Y,,} be a collection of zero-mean (E/Y;| = 0) random 
variables. The space H of all random variables that are linear combinations of those n 
random variables {y1,---, Yn} is a Hilbert space with inner product 

Equation: 


(X,Y) = Eley] = [ . / " aap aaaady 


We can easily check that this is a valid inner product: 


(6a = hie =. |x|" fe (x) da = B||2/"| = (); 

e (x, x) = Oif and only if fx (x) = 6 (2), ie., if X is a random variable that is 
deterministically zero (and this random variable is the “zero” of this Hilbert 
space); 

© (ty) = (y &); 

°(c+y,2) =E\(e@+y)2] = Baz] + Elyz] = (2, z) + (y, 2); 


Note in particular that orthogonality, i.e., (x, y) = 0, implies E[zy] = 0, i-e., x and y 
are independent random variables. Additionally, the induced norm 

|| X ||= ./(X, X) = ./E||z|?] is the standard deviation of the zero-mean random 
variable X. 


A Hilbert Space of Random Vectors 


One can define random vectors X, Y whose entries are random variables: 
Equation: 


Xn Yn 
For these, the following inner product is an extension of that given above: 


Equation: 


(X,Y) =Ely"z] =E = E|trace[zy™] |. 


n 
(=I 


The induced norm is 


Equation: 
|X |= (%X) =B| vate] = £ 


the expected norm of the vector z. 


Minimum Mean Square Error Estimation 


In an MMSE estimation problem, we consider Y = AX + N, where X, Y are two 
random vectors and N is usually additive white Gaussian noise (Y is m x 1, A is 
mxn,Xisn x 1,andN ~ N (0, o°1) is m x 1). Due to this noise model, we want 


an estimate X of X that minimizes E | XX. | ; such an estimate has highest 


likelihood under an additive white Gaussian noise model. For computational simplicity, 
we often want to restrict the estimator to be linear, i.e. 
Equation: 


P<) 


=KY=]|: |y, 


where K. a denotes the i*” row of the estimation matrix K and X ro? 4 = Y. We use 


the definition of the 22 norm to simplify the equation: 
Equation: 


min B|| NK 3] =min 5]| X—KY 2] =min E|\~ (X; — KEY)’ 
K K o = 


Since the terms involved in the sum are independent from each other and nonnegative, 
this minimization can be posed in terms of n individual minimizations: for 
4=1,2,...,n, we solve 

Equation: 


a 


min BI (X; — Kit)’ =min EB n 8] (x -Km) = || X; = d_ K¥il 


where the norm is the induced norm for the ae space of random variables. Note at 
this point that the set of random variables 5)", K;;Y; over the choices of K; can be 


written as span({¥j}""). Thus, the optimal K;; is given by the coefficients of the 


closest point in span({¥j}"",) to the random variable X; according to the induced 


norm for the Hilbert space of random variables. Therefore, we solve for A’; using 
results from the projection theorem with the corresponding inner product. Recall that 
given a basis Y; for the subspace of interest, we obtain the equation 


3;= G(K#)* = GK,, where Big = (Xi, Yj) and G is the Gramian matrix. More 
specifically, we have 
Equation: 


(Xi, ¥1) (Yi,¥1) (¥2,¥i) +++ (ns ¥i)] | Ka 
(Xi, Y2) (Yi, Y2) (Yo, Y2) we F (Yn, Y2) Ki2 


(Xs Yon) (Yay Yan) (Yo, Yin) “eH Yarn) Kim 
ee N SS ——————— ee 
Thus, one can solve for K; = Gu! 8;. In the Hilbert space of random variables, we 


have 
Equation: 


EMYi] Eley) --- El¥nY] 
E[MYo] El¥Y2] --- El¥mY9] 

G = = Ry, 
BIY:¥ 9] BIY2¥q] eee E[Vn¥n] 
E|XiY| 
E|X;Y9] 

B= ; = pxy- 

E|XiYm| 


Here Fey is the correlation matrix of the random vector Y and px,y is the cross- 
correlation vector of the random variable X; and vector Y. Thus, we have 


K,;=G"',; = Re px,y,andso k - = pyyly * Concatenating all the rows of K 
together, we get kK = R xyRy', where fx.y is the cross-correlation matrix for the 
random vectors X and Y. We therefore obtain the optimal linear estimator 

Vv —1 

X=KY=RyyRyY. 


At first, there may be some confusion on the difference between least squares and 
minimum mean-square error. ‘To summarize: 


e Least Squares are applied when the quantities observed are deterministic (i.e., a 
“single draw” of data or observations). 

e Minimum Mean Square Error Estimation are applied when random variables are 
observed under Gaussian noise; one must know a distribution over inputs, and the 
error must be measured in expectation. 


Infinite-Dimensional Hilbert Spaces 
Describes the extension of Hilbert spaces to infinite dimensions, including complete 
orthonormal sequences and Parseval's relation. 


While up to now bases have been linked to finite-dimensional spaces and subspaces, 
it is possible to extend the concept to infinite-dimensional spaces. 


Definition 1 Let (X, R,+,-) be a vector space. An infinite sequence of orthonormal 
vectors {€1, €2,...} C X is said to be a complete orthonormal sequence (CONS) in 
X if for every x € X there exists a sequence a1, @2,...€ Rsuch that 


De >) Oe. 


For the sake of concreteness, an infinite sum is defined as 
ze = S02, aje; —lim, ,.. )>7_) ae;. It is easy to see that a; = (x, €;). 


Lemma 1 An orthonormal sequence is complete if and only if the only vector in _X 
orthogonal to each of the e;'s is the zero vector. 


Example 1 For the space X of finite-energy complex-valued functions, R = C, a 
CONS is given by e; (t) = eJkt fork = 0, +1, +2,.... These vectors are 


T 
orthonormal: 
Equation: 


an if ; 4 21 1 , 1 k = l 
_ 7 a gkt —jlt qt = / ia) ik—-Dt Gy = 
(en, €1) / Qn 0 270 0 kAl 


The coefficients are given by 
Equation: 


an 
Ce ei 7 a (t)e "dt, 
0 


and we obtain z = > , Ckek. This is the sequence behind the Fourier series 
representation. 


Example 2 Let X be the space of bandlimited functions x(t) (i.e., the set of 
functions with Fourier transform X(f) such that |X(f)| = 0 for all f ¢ [—B, B)). 
A CONS for this space is given by 


Equation: 


x)= 1 sine(20(e- 5) 


where sinc(t) = (sin (zt))/(zt). It is possible to show that the functions are 
orthogonal to each other, i.e., 
Equation: 


I set 
(ene) = Burg ony 


If x is bandlimited, then it follows that x (t) = 5°, cxex (t), with 
Ch = (, ex) = x (k/ (2B)). The preservation of the norm in the coefficients can 
also be extended from ONBs to CONS. 


Theorem 1 (Completeness Relation) An orthonormal sequence ej, €2,... is 


complete for X if and only if the completeness relation holds for all x € X: 
Equation: 


Ie => @e) =D? 


4 


The sequence {e;} is CONS if and only if 
Equation: 


CO 
— S (a, e1)e; = lim S (a, e:)e; = lim Ln 


where z,, = >_,_, (a, e;)e; is the partial sum. We then have 


|| x ||? =|| e—axy, ||? +|| ap ||? as these two components are orthogonal to each 
other. Applying a limit on both sides for n, we have 
Equation: 

2 


CO 
| @ ||? = Jima || @—arn ||? + lim |] an ||? = 0+ lim 7 (a,e:) = >) | (a, ei)’. 


Theorem 2 Let X be a Hilbert space with a CONS {e1, e2,... }. Then for any 
x,y € X, Parseval's relation holds: (x, y) = >>, (a, e;) (y, e:). 


Using the CONS, we can write the partial sums v7, = 5>\"_, (a, e;)e; and 
Yn = >> ;_1 (y, ei) ex. We then have 
Equation: 


|(2n, Yn) — (£,Yy)| =[(2n,¥) — (Ln, y) + (Ens Yn) — (2, y)| 
(Casta _ y) oo (Pn _ z,Y)| 
(ns Yn — ¥)|+](@n — @, y)| 
| &n [Ill Yn — || + |] tn — 2 Il] yn Il 


IN IA 


Letting n — oo we have that the upper bound goes to zero, and therefore as n — oo 
, we have (2n, Yn) — (x, y). Therefore, 
Equation: 


(zy) =Jim, (@n,Yn) =Jim 2 UK #,ei)ei» (Ys €3)€5) 
w=1 j= 


n 


= papa, ©, €i) (Ys €j) (Cis € ej) = lim S | (a, es) (y, e2) 


w=1 J= w=1 


ce 


i=1 


Linear Functionals 
Introduces the definition of a linear functional in a normed space and the norm of a functional. 


Many systems in engineering can be characterized as linear systems, taking inputs in one signal 
space into outputs in another. The most common type of system maps into a space of scalars, 
defined as maps f : X — R. We will study these maps (known in the mathematics literature as 
functionals) during the next few lectures. 


Definition 1 A functional f on X is linear if for all x,y € X,a,b € R, we have that 


f(aa + by) = af(x) + bf(y). 


Example 1 In the case X = R”, all linear functionals can be written using the form 
f (x) = SO3_, aia; for some set of scalars {a1,...,an} CR. 


Example 2 For a Hilbert space X, f (x) = (x, y) for any y € X is a linear functional. 


Example 3 For the space X = C([0, 1]), the sampling functionals 
f (z) = 2 (to) = (a, 6(t — to)) are linear. 


Example 4f(xz) =|| z || is not a linear functional (since the triangle inequality is not a strict 
equality). 


Linear functionals are particularly amenable to analysis. 


Theorem 1 If a linear functional on a normed space X is continuous at a point zo € X, then it 
is continuous on the entire space X. 


Assume f(a) is contiuous at xo. Let {x, } be a sequence of points converging to x. Then, 
Equation: 


lf (@n) — f(z)| = |f (an + 20 — 20) — f (2)|= |f (tn + 20) — f (0) — f (2)| 
= |f (@n — 2+ 20) — f (Zo)|. 


It is easy to show that {x,, — x + xo} is a sequence that converges to x9. Therefore, 
f(an — x + 20) has to converge to f(x) as f is continuous at xo: 
Equation: 


n—-0oo 


IF ta eaeo) =F (eg) —> 0. 


This implies that the sequence { f (x,,)} converges to f(z) and so f is continuous at x. Since x 
was arbitrary, then f is continuous on X. 


Fact 1 For f(x) a linear functional, what is the value of f(0)? For any such f, 
Equation: 


f(0) = f(0- 2) =0- f(x) = 


Definition 2 A linear functional f on a normed space X is bounded if there exists a constant 

M < cosuch that | f(x)| < M || x || for all e € X. The smallest such element is denoted as the 
norm of the function || f ||: 

Equation: 


| F ing (M: |f (@)| <M || 2 |, Ve € X}. 


There are several ways in which we can write this norm: 
Equation: 


F(z)) lf(z)|= sup |f(z)| 


Lei 2 |] vex, alt x€X,|le||<1 


| fll- s 


Before continuing, we should verify that the defined || f || is a valid norm. 


Fs a0 S fH2 2) = 0 foralls =X, 
|| f ||= 0: M has to be positive since | f(x)| and || x || are always positive. 


¢ || af \|= lal || f ||: can be trivially shown. 
© ll fi + fe ISI fi Il + Il fe |]: by definition, 
Equation: 
Il fi + fe || =o (fi + fo) (z ss fi (x) + fa (z Ms at p (fi (@ ) [+1 fa (x) |); 


<sup |fi(x)|+ sup |f2(x)| =|] fi ll + | fe |). 


Ile || <1 \|e||<1 


The following theorem can save us much work in checking for continuity of linear functionals. 
Theorem 2 A linear functional on a normed space is bounded if and only if it is continuous. 


(=) Assume f is bounded, and let M be such that |f(x)| < M || a || for alla € X. Then fora 
sequence {z,} — 0 we have |f (an)|< M || z, ||.Therefore, | f(x)| — 0 = f(0), which 
implies that f is continuous at z = 0. Using Theorem [link], f is continuous on X. 


(<) Assume f is continuous. If we set € = 1 there exist 6 > 0 such that if || 2 — 0 ||< 6 then 


|f(z) — f(0)| < Lie. 


Equation: 


| t <6 = |f(a)| <1. 


Since f is linear, we can write this as 
Equation: 


|e I<1slF@)< 5. 


So, || f ||< 7 < oo and f is bounded. 


Example 5 Let X be a space of finitely nonzero sequences with || x ||, =max; |x;|. Define a 
functional f as 
Equation: 


[e.@) 
a So iz 
=I 


f is linear because 
Equation: 


oe) CO 


f(axz+by) = Dei ax + by); i (ax; + by;) = (atx; + biy;) 
z 1 i=1 


Ww 4 4 


oe) 


= aoe +b in = af (@ ) + bf (y) 


If we want to show that f is unbounded, we must show that for every M > 0 there exists x € X 
— | M| times 


such that | f(x)| > M || x ||. Leta = [1,1,...,1 ,0,0,... | © X, where | /] is the 
-_ee—- 


smallest integer greater or equal to MM. Then, 
Equation: 


| M| | M| 


=S0i-1>5°1=[M]| >. 


Therefore, f is unbounded. 


Dual Spaces 
Definition of the dual space of a normed space containing all linear operators. 


Definition 
The set of linear functionals f on X is itself a vector space: 


¢ (f + 9)(x) = f(x) + 9(z) (closed under addition), 
¢ (af)(x) = af(z) (closed under scalar multiplication), 
¢ z(x) = Ois linear and f + z = f (features an addition zero). 


Definition 1 The dual space (or algebraic dual) X* of X is the vector space of all linear 
function on X. 


Example 1 If X = RN then f (xz) = Sy a,x; € X° and any linear functional on X can 
be written in this form. 


Definition 2 Let X be a normed space. The normed dualX of X is the normed space of all 
bounded linear functionals with norm || f ||=supyzj<1 |f (z)|. 


Since in the sequel we refer almost exclusively to the normed dual, we will abbreviate to 
“dual” (and ignore the algebraic dual in the process). 


Example 2 Let us find the dual of X = RY. In this space, 
Equation: 


L1 
L2 a 
x= |. | and || a ||= JX, |x;|“, where 2; ER. 


Dn 


From Example [link], a linear functional f can be written as: 
Equation: 


ck x,a), a€ R” 


There is a one-to-one mapping between the space X and the dual of X: 
Equation: 


fe(R") oaER 


For this reason, R” is called self-dual (as there is a one-to-one mapping (IR")” <— RY”). 
Using the Cauchy-Schwartz Inequality, we can show that 
Equation: 


[f(x)| = (a, a)] <I] @ IIll@ |, 


with equality if a is a scalar multiple of 2. Therefore, we have that || f |/=|| a ||, i-e., the 
norms of X and X* match through the mapping. 


Theorem 1 All normed dual spaces are Banach. 


Since we have shown that all dual spaces X * have a valid norm, it remains to be shown that 
all Cauchy sequences in in X” are convergent. 


Let 4 fib eo * be a Cauchy sequence in X ’. Thus, we have that for each € > 0 there 
exists Ng € Z* such that ifn,m > no then || fn (x) — fm (x) || > 0 as n,m — oo. We 
will first show that for an arbitrary input « € X, the sequence {f, (x)}°°_, C R is Cauchy. 
Pick an arbitrary € > 0, and obtain the np € Z* from the definition of Cauchy sequence for 


{f. ate Then for all n,m > no we have that 


|fn (2) — fm (#)| =| (fn — fm) (®)ISI] fa (@) — fm (#) | ae sel I 


Now, since R = R and R = C are complete, we have that the sequence { f,, (x) } 


[ee 

n=1 

converges to some scalar f* (a). In this way, we can define a new function f * that collects 

the limits of all such sequences over all inputs x € X. We conjecture that the sequence 

fn — f : we begin by picking some € > 0. Since { fn (z)}°°, is convergent to f* (x) for 
* 

each x € X, we have that for n > no2, |fn (x) — f, (x)| < €. Now, let 


* * 
No =SUPzex,||z|=1 0,2. Then, forn > no, 


Equation: 


lfn-f° I= sup |fn(z)-f 


* 
Therefore, we have found an 7, for each e that fits the definition of convergence, and so 


fot’. 


Next, we will show that f* is linear and bounded. To show that f is linear, we check that 
Equation: 


f (ax + by) lim, fa (axe + by) =a lim fy («) + lim fr () = af” (x) + bf” (y). 


* 
To show that f” is bounded, we have that for each € > 0 there exists ng € Z* (shown 


* 
above) such that ifn > ng then 
Equation: 


f* (2)| S|fn (2) — f° (@)|+1fn (@)| < €llel| + [fo ll lel] < (e+ I FalDIlzll- 


This shows that f’ is linear and bounded and so f’ € X’. Therefore, { ee a converges 


in X* and, since the Cauchy sequence was arbitrary, we have that X* is complete and 
therefore Banach. 


The Dual of Z,, 


Recall the space £, = {(21,@2,...) : )572, |ai|’< oo}, with the £,-norm 

lzll, = oy laeel?)’?. The dual of £, is (£p) = £, with q = pr: That is, every linear 
bounded functional f € (Lp) can be represented in terms of f (x) = S>°°., Qn@n, where 
a = (a1, @2,...) € &,. Note that if p = 2 then q = 2, and so £; is self-dual. 


Bases for self-dual spaces 


Consider the example self-dual space R” and pick a basis {e, €2,..., €n} for it. Recall 
that for each a € IR” there exists a bounded linear functional f, € (R") given by 
fa (v) = (x, a). Thus, one can build a basis for the dual space (IR) as 


Jets Jens Aca Few: 


Riesz Representation Theorem 


Since it is easier to conceive a dual space by linking it to elements of a known space (as 
seen above for self duals), we may ask if there are large classes of spaces who are self-dual. 


Theorem 2 (Riesz Representation Theorem) If f € H’™ is a bounded linear functional on 
a Hilbert space H, there exists a unique vector y € H such that for all  € H we have 
f(x) = (a, y). Furthermore, we have || f|| = ||y||, and every y € H defines a unique 
bounded linear functional fy (x) = (x, y). 


Thus, since there is a one-to-one mapping between the dual of the Hilbert space and the 
Hilbert space (H “~ H ), we say that all Hilbert spaces are self-dual. Pick a linear bounded 
functional f € H” and let. C H be the set of all vectors x € H for which f(x) = 0. 
Note that -V is a closed subspace of H, for if {z,,} C -W is sequence that converges to 

x € H then, due to the continuity of f, f (z,) > f (a) = Oandsoz € YW. We consider 
two possibilities for the subspace -V: 


e If N = H then y = Oand f(z) = (2,0) = 0. 
¢ If N * H then we can partition H = VW oa AV + and so there must exist some 
zo € M+, z #0. Then, define z = ie y to get z € W+ such that f(z) = 


Now, pick an arbitrary x € H and see that 
Equation: 


f(x — f(a)z) = fla) — fle) F(z) = F(z) — F(z) = 0. 


Therefore, we have that « — f(x)z € -Y. Since z | VW, we have that 


Equation: 
(x — f(z)z,z) =0, 
(2; 2) = f(z) (z, 2) — 0, 
and so 
Equation: 


(8,2) ” (wie) a eo 
aa) le (se 


Thus, we have found a vector y = —; such that f(x) = (a, y) for all x € H. Now, 


to compute the norm of the factional consider 
Equation: 
|f(z)| = |e, y)| < lel] - [ly 


with equality if x = ay fora € R. Thus, we have that || f ||=|| y ||. 


We now show uniqueness: let us assume that there exists a second y 4 y, y € H, for which 
f (x) = (a,9) for all z € H. Then we have that for all x € H, 


Equation: 
(x,y) = (2,9), 
— (2,9) =0, 
ek i) = 0, 
y— y = 0, 
y =¥. 


This contradicts the original assumption that y # y. Therefore we have that y is unique, 
completing the proof. 


Linear Functional Extensions 


When we consider spaces nested inside one another, we can define pairs of functions that 
match each other at the overlap. 


Definition 3 Let f be a linear functional on a subspace M C X of a vector space X. A 
linear functional F is said to be an extension of f to N (where N is another subspace of X 
that satisfies M C N C X)if F is defined on N and F is identical to f on M. 


Here is another reason why we are so interested in bounded linear functionals. 


Theorem 3 (Hahn-Banach Theorem) A bounded linear functional f on a subspace 
M C X can be extended to a bounded linear functional F’ on the entire space X with 
Equation: 


|F()| 
|| = || 


=| f |=sup MOM Loy, Pte 


me | z || sem || x || 


|| F fase Ls eae 


The proof is in page 111 of Luenberger. 


Linear Operators 
Introduces linear operators, along with properties and classifications. 


We begin our treatment of linear operators, also known as transformations. 
They can be thought as extensions of functionals that map into arbitrary 
vector spaces rather than a scalar space. 


Definition 1 Let X and Y be be linear vector spaces and D C X. A rule 
A: X — Y which associates every element in x € D with an element 
y = A(x) € Y is said to be a transformation from X — Y with domain D. 


We have defined D because the transformation may only be defined for 
some subset of X. 


Definition 2 A transformation A :X — Y where X and Y are vector spaces 
over a scalar set R, is said to be linear if for every 71, 22 € X andall 
scalars @1,Q2 € R, 

Equation: 


A (a 121 + Q2%2) = a, A (x1) + Q2A (22) 


A common type of linear transformation is the transformation 

A: R” — R”. In this case, A is an m X n matrix with real-valued entries 
(i.e., A € R™*"). There are a variety of linear transformations that arise in 
practice, producing equations of the form A(x) = y, with x € X and 

y € Y, where X and ¥Y are linear vector spaces. For example, the equation 
Equation: 


may be written in operator notation as A(x(t)) = y(t), where 
A: C[T] — C[T] is the operator 
Equation: 


Often, we will simply write y = Az. 


Definition 3 Let (X, || - ||), (Y, || - ||y) be normed vector spaces. A 
linear operator A : X — Y is bounded if there exists a constant M < co 
such that || Aw — Y| < M || a || y for all 2 € X. The smallest M that 
satisfies this condition is the norm of A: 

Equation: 


[Alesy Sine 42 = tune Asp = taae. | Aaely. 
a reX |la|ly eX, lel] ,=1 xeX,||z|| <1 


Geometrically, the operator norm || A || measures the maximum extent A 
transforms the unit circle. Thus, || A || bounds the amplifying power of the 
operator A. 


Operators possess many properties that are shared with functionals, with 
similar proofs. 


Definition 4 A linear operator A : X — Y is continuous on_X if it is 
continuous at any point x € X. 


Theorem 1 A linear operator is bounded if and only if it is continuous. 
Definition 5 The sum of two linear operators A; : X — Y and 
Az: X — Y is defined as (A; + Ap)x = Ayx + Agz. Similarly, the 


scaling of a linear operator A : X — Y is defined as (cA)x = c(Az). Both 
resulting operators are linear as well. 


We can also extend the definition of the dual space to operators. 


Definition 6 The normed space of all bounded linear operators from 
X — Y is denoted B(X,Y) 


Are these spaces complete? 


Theorem 2 Let X, Y be normed linear spaces. If Y is a complete space 
then B(X, Y) is complete. 


Much of the terminology for operators is drawn from matrices. 


Definition 7 Let X be a linear vector space. The operator J : X — X given 
by I(x) = z for all x € X is known as the identity operator, and 
Te B(X,Y). 


Definition 8 Let A; :_ X — Y and Az: Y — Z be linear operators. The 
composition of these two operators (G2G1)x% = G2 (G12) is called a 
product operator. 


Definition 9 An operator A : X —> Y is injective (or one-to-one) if for each 
y € Y there exists at most one x € X such that y = Az. In other words, if 
Ax, — Az then x; = 2p. 


Definition 10 An operator A : X —> Y is surjective (or onto) if for all 
y € Y there exists an x € X such that y = Az. 


Definition 11 An operator A : X —> Y is bijective if it is injective and 
surjective. 


Lemma 1 If A; : X — Y isa bijective operator, then there exists a 
transformation Ay : Y + X.such that Ay (Aix) = x forall x € X. 


Note that the lemma above implies A.A, = I. Thus, we say that A, is 
invertible with inverse A ae 


Definition 12 An operator A : X — X is non-singular if it has an inverse 
in B(X, X); otherwise A is singular. 


In other words, if a transformation A : X —> X is non-singular there exists 
a transformation A~+ such that AA~! = J. This extends the concept of 
singularity from matrices to arbitrary operators. 


Properties of Linear Operators 
Discusses basic properties and classifications of operators. 


We start with a simple but useful property. 


Lemma 1 If A € B(X, X) and (z, Ax) = 0 for all e € X, then A= 0. 


Let a € C, then for any xz, y € X, we have that x + ay € X. Therefore, we obtain 
Equation: 
0 = (z+ ay, A(x + ay)), 
= (x + ay, Ax + aAy), 
= (a, Ax) + a(x, Ay) + a(y, Az) + |al"(y, Ay). 


Since x,y € X, we have that (x, Ar) = (y, Ay) = 0; therefore, 

0 =a(a, Ay) + a(y, Az). 

If we set a = 1, then 0 = (a, Ay) + (y, Az). So in this case, (x, Ay) = —(y, Az). 
If we set a = 7, then 0 = —i(a, Ay) + i(y, Ax). So in this case, (x, Ay) = (y, Az). 


Thus, (x, Ay) = 0 for all x, y € X, which means Ay = 0 for all y € X. So we can come 
to the conclusion that A = 0. 


Solutions to Operator Equations 


Assume X and Y are two normed linear spaces and A € B(.X,Y) is a bounded linear 
operator. Now pick y € Y. Then we pose the question: does a solution  € X to the 
equation Ax = y exist? 


There are three possibilities: 


1. A unique solution exists; 
2. multiple solutions exist; or 
3. no solution exists. 


We consider these cases separately below. 


1. Unique solution: Assume z and «x, are two solutions to the equation. In this case we 
have Az = Ax,. So A(x — x) = 0. Therefore, x — 2 € VW (A). If the solution x 
is unique then we must have = x, and x — x; = 0. Therefore, .V(A) = {0}. 
Since the operator has a trivial null space, then A~! exists. Thus, the solution to the 
equation is given by & = A ly; 


2. Multiple solutions: In this case, we may prefer to pick a particular solution. Often, 
our goal is to find the solution with smallest norm (for example, to reduce power in a 
communication problem). Additionally, there is a closed-form expression for the 
minimum-norm solution to the equation Ax = y. Theorem 1 Let X, Y be Hilbert 
spaces, y € Y, and A € B(X,Y). Then Z is the solution to Az = y if and only if 
¢@ = A°z, where z € Y is the solution of AA’z = y. Let 2; bea solution to Ax = y. 
Then all other solutions can be written as = xz — u, where u € .V(A). We 
therefore search for the solution that achieves the minimum value of ||z1, — u|| over 
u € W(A), ie., the closest point to x; in-4(A). Assume @ is such a point; then, 

a, —% L W (A). Now, recall that (4 (A))* =B@ (A’), sor=27—-Uuck (A*) 
. Thus @ = A’z for some z € Y, and y = Az = AA ‘z. Note that if AA’ is 
invertible, then @ = A°z = A*(AA’) “ty. 

3. No solution: In this case, we may aim to find a solution that minimizes the mismatch 
between the two sides of the equation, i.e., || y — Az ||. This is the well-known 
projection problem of y into #(A). Theorem 2 Let X, Y be Hilbert spaces, y € Y, 
and A € B(X,Y). The vector # minimizes ||z — Ay|| if and only if A AZ = A’y. 
Denote u = Az so that u € &(A). We need to find the minimum value of ||y — ul| 
over u € &(A). Assume 7 is the closest point to y in @(A), then y — tu 1 #(A), 
which means y — @ € (#(A))~. Recall that (#(A))” = VY (A*), which implies 
that y-— WE NV (A’). So A’ (y — at) = 0. Thus A*y = A‘@. Now, denoting 
u = AZ, we have that Ay — A’ A®. Note that if A A is invertible, then we have 


* 


@=(A*A) A’y 


Unitary Operator 
Definition 1 An operator A € B(X, X) is said to be unitary if AA’ = A’A = I. 
This implies that AY = A~!. Unitary operators have norm-preservation properties. 


Theorem 3A € B(X, X) is unitary if and only if (A) = X and ||Az|| = ||x|| for all 
ae, @ 


Let A be unitary, then for any x € X, 
Equation: 


ell = (@2) = y(@, Aa’2) = (de, Az) = ||As 


Since I: X + X and AA’: #(A) > #(A), then X = &(A). So if A € B(X, X) is 
unitary, then #(A) = X and ||Az|| = ||x|| for all « € X. From [link] we can find that 
Equation: 


0 =|la|? —||Az|? = (a,x) — (Ax, Ax) = (a, 2) — (2, A’ Az) = (2,2 7 A’ Az ). 


Since this is true for all 2 € X we have that 2 — A* Ax = 0 for all x € X, which means 
that x = A Az for all x € X. Therfore, we must have A’A = J. Additionally, since the 
operator is unitary, we find that for any x, y € X we have that 

|| Az — Ay|| = || A(a — y)|| = ||z — y||. So z = yif and only if Ax = Ay, implying 
that A is one-to-one. Since #(A) = X, then A is onto as well. Thus, A is invertible, 
which means A~!A = I. So A~!A = I = A’A, and therefore A~! = A’. The result is 
that A is unitary. We have shown that if (A) = X and ||Az|| = ||x|| for all z € X, 
then A € B(X, X) is unitary. 


Corollary 4 If X is finite-dimensional, then A is unitary if and only if || Az|| = ||z|| for 
alle UX: 


Fundamental Subspaces of an Operator 
Discusses the two fundamental subspaces of a linear operator - range and 
nullspace - and the rank of an operator. 


Definition 1 Let X be a finite dimensional Hilbert Space with orthonormal 
basis {b1, b2,...,6,} and A: X — Y bea linear operator. The range of A 
is the subspace 

Equation: 


& (A) = {y EY: y= Az € Xx} = [{ Abi, Abo, eb, tI. 


It is easy to see that Z(A) is a subspace of Y. To show the second equality 
above, note that if x € X then it can be written as 7 = Sa aj;b;; 
therefore, 

Equation: 


Az=A b> ot = S A (a;b;) = Sa: Ab; = s aii, 
= =1 =I =I 


where w; = Ab;. Before we can claim that {7;} is a basis for #(A), we 
must first show its elements are linearly independent. 


Lemma 1 If X be n-dimensional and A : X — Y, then &(A) has 
dimension less than for equal to n. 


If the set {w1, Wa, .--, Wn} given above is linearly independent then it is a 
basis for Z(A), and the dimension of #( A) is n. If they are not linearly 
independent we must show that dim(#(A)) < n. If {1, W2,..., Wn} are 
linearly dependent then there exists a set of scalars {a 1, @, ...,@,} such 
that }>""_, a;w; = 0 with at leas one nonzero a;; we let that be a without 
loss of generality. We then have 

Equation: 


y= 7 Sats, 
i=2 


Oj 


and so span ({w1, ..., Wn}) = span ({to, ..., n}). If the set {wo, ..., dn} 
is linearly independent, then dim(#(A)) = n — 1 < n. Otherwise, iterate 
this procedure to show that dim(#(A)) <n — 1. 


Definition 2 The null space of a A is the set of all points x € X that map to 
zero: 


Equation: 


N(A) = {x € X: Ax = 0}. 


It is easy to see that .V(A) is a subspace of X. 

Lemma 2 An operator A is non-singular if and only if (A) = {0}. 
We can extend the concept of rank from matrices to operators. 
Definition 3 The rank of A is the dimension of @(A) = rank(A). 


Theorem 1 Let A : X — Y and X be an n-dimensional space. Then, 
Equation: 


rank(A) + dim(./(A)) =n. 


Let (A) have dimension m. Design an orthonormal basis e1,..., En 
such that €1,..., @m are an orthonormal basis for .”(A) C X. This 
implies that ~; = Ae; = 0, for? = 1,...,m. Therefore, 


& (A) = span ({pi}i_1) = span cs eee SO 
rank(A) = dim(&#(A)) < n—™m. 


Now we need to show that {7;}"__, ,, are linearly independent. We use 
contradiction by assuming that {7);}"__, ,1 are linearly dependent, which 


means that pee 41 biti = 0 for some scalars 6; that are not all zero, i.e., 
Equation: 


so we know >"... bie; € W (A). We also know that e; | e; ,1 F J, 
and soe; | WY (A) fort = m+1,...,n. This implies in turn that 


. b;e; | W (A) for all choices of {b;}. Because 


i=m+1 
iem+1 bie; € W (A), the only possibility is that \77"_,4 bie: = 0. 
However, the orthonormal basis vectors e; fort = m+ 1,...,n are linear 


independent, and so we must have that 6; = 0. This is a contradiction with 
our original assumption, implying that the vectors {w;};"__, ,1 are linearly 
independent. Therefore, this set of vectors is a basis for R(A) and 
rank(A) = dim(R(A)) = n — m. This in turn implies that 

rank(A) + dim(.W(A)) =n. 


Adjoint Operators 
Definition and properties of an adjoint operator between normed spaces 


Adjoint operators allow us to translate inner products in the destination space to inner products in the source 
space. 


Definition 1 Let X and Y be inner product spaces and A € B(X,Y). The adjoint operator A’: Y- Xis 
defined by the equation (Az, y)y = (2, A*y) y foralla ce X,yeY. 


The subindices in the inner product notation clarify the inner product space we refer to, but it is often implicit 
from the inputs. 


Example 1 Pick X = R”, Y = R”, and define the operator A € B(X,Y) by an m x n matrix; we want to find 
its adjoint A’. We appeal to the definition: 


(Az,y)y = y" (Aa) = y" Aw = (yA) = (ATy) “a = (2, ATy) y 
so according to the definition, we have that Ay = AT y, resulting in A” = AT. 
Theorem 1 If A € B(X,Y) then the adjoint operator A” € B(Y, X) and || A® ||=|| A |. 


To see A’ is linear, note that 
Equation: 


(2, A’ (ary + azy2) ) = (Az, ary1 + azy2) = a1 (Ax, y1) + @2 (Ax, y2) = ar (2, A’y:) + a2 (2, Ay), 


= (2,a14"y: + aA yo . 


so A’ (a1y1 + a2y2) = a1 Ay + a2A yz. We will next show that || A” ||<|| A || and || A ||<|| A” ||, which 
implies that || A* ||=|| A ||. First, note that 
Equation: 


Ave? _ (Ale, A’e) — (AATa 2) AA IIe AIA AIA Ile 
eal | @ II? We? teh et | = | 


S|] A Ill] 4° Ih 


? 


A® |?<| A [I] 4° 


since this is true for all x € X, ,and so || A* ||<|| A |]. 


ext, PICK Xo and set yo = . Note that 1 = A, then Ax = U for all x and the adjoint 
Next, pick x ¢ WV (A) and [antp: Note that if W(A) = X, then Aw = 0 for all x and the adjoi 


Ay = 0 for all y satisfies the theorem. For such a choice of zo and yo, we note that 
Equation: 


| Avy |? = (Azo, Avo) = (Azo, || Azo || yo) =|| Avo || (Av0,¥0) =l| Azo || (20, A°yo), 


S|| Azo |I|| eo [Ill A°yo ISI] Azo lll zo Ill A” Ill] yo =H] Azo |Ill 20 Ill A” I, 


||Azol| 
\|zol| 


s0 || Azo ||<|| 20 ||| A” <|| A’ A ||<|| A’ ||. Therefore, || A ||=|| A” || 


, and 


; thus, 


Fact 1 Some quick facts on adjoint operators: 


ie ae! 

2. (Ar + Ae)” = A, + Ay. 

3.(aA) =aA’. 

A (AAG) =AGA: 

5. If Aq! exists, then (A71)" = (A*). 


Definition 2 An operator A € B(X, X) is said to be self-adjoint if A” = A. 


Definition 3 Let X be a Hilbert space. A self-adjoint linear operator A € B(X, X) is said to be positive 
semidefinite if (x, Ax) > 0 for alla € X. 


Example 2 Let X = L» [0,1] and define an operator A € B(X, X) by 
Equation: 


ui) =Ale) = f K (t,s)x(s)ds, t € [0,1), 


where if if |K (t, s)|” ds dt < 00. What is its adjoint A*? 


To find the adjoint, we use its definition (Az, w) = (2, Aw): 
Equation: 


gas, = [ (40) (0 (t) dt = [ [ «eo Ouugds [ [ «eo (aywileiais: 
= [209 [ Ke sw(p atds = (er), 


where v(s) = (A*w) (s) = fy K (t, s)w (t) dt; changing variables, (A*w) (t) = J) K (s,t)w(s) ds. 


Example 3 Let X = L» [0,1] and define an operator A € B(X, X) by 
Equation: 


1S O= / KeosGa. 


What is its adjoint A~? Once again, from the definition, 
Equation: 


(Agia = [ (40 (t)w (t) dt = [ [ees (jie Qa [ [ees OOn eS 


ai i K(t,3)x(s)w (0) dtas = [ (6) [ K (t, s)w(t) dt ds. 


We have that meg ou " K (t, s)w (t) dt; changing variables, we obtain the adjoint 


(A"w) (t) = fi K (s, t)w(s) ds. 


Example 4 Let X = L» [0,1], Y = R%, and define the operator A € B(X,Y) as 
Equation: 


a(t1) 
x(to 
y= A(x) = : 


n(tn) 


What is its adjoint A’? Once again, from the definition of adjoint, 
Equation: 


(Az, w)y = (2, A‘w) 


where the subscript denotes the corresponding space for the sake of clarity. Then, 
Equation: 


(Az,w)y = (Iz (t;) a (tg) ... 2 (tn)? = ye (t;)wi, 


(2, A‘w) | = i: x (t) (4°w) (t) dt = i ax (t)v (t) dt 


where v(t) = (Aw) (t). We can successfully match these two inner products by using delta functions to define 
v. Recall the properties of delta functions: (7) x (t)d (t — to) = x (to) 6d (t — to), (it) fe. 6(t — to) dt = 1. 
Therefore, we may set v (t) = }>/’., w;6 (t — t;), which then provides 

Equation: 


ae  ft0(Somaee )) ae a [ e@wste—2) a) 


a 


N 

=dul f a (t;)d (t — 1) a = S> wiz (ti) (f 5(t — ti) ar) = Yo wr(t) 
t=1 0 i=l 0 

This confirms that v(t) = A”w (t) = >, wid (t — ti). 

Theorem 2 Let X, Y be normed spaces and A € B(X,Y), then (@(A))* = VW (A’). 

We will show the double inclusion (# (A))~ C W (A*) and VW (A*) C (@(A))~. Lety € Y(A’), ie, 


A*y = 0. Then for all x € X, we have 
Equation: 


(2, A‘y) =0= (Az, y) 


and so y € (#(A))~, which implies that W (A*) C (#(A))~. Now, pick y € (#(A))~, ie, for all 2 € X 
we have that (Ax, y) = 0. Then, (z, A’y) = 0, and since this is true for all x € X, then A”y = 0 and 

y € W(A’). Therefore, (#(A))> CN (A*). The two inclusions imply that (#(A))> = NM (A*). The 
following statements can be proved in a similar fashion. 


Theorem 3 Let X, Y be normed spaces and A € B(X,Y), then 


Example 5 Let A 


1 
| , then # (A) = von 
1 


WN (A*) = span - hs is easy to check that (# (A))~ = WV ( 


I 
| re are | 
pnNow 
wre 


a 


1 


Matrix Representations of Linear Operators 
Introduces matrix representations of linear operators, with examples. 


Linear operators involving finite-dimensional spaces can be represented in terms of matrices. Assume that X and 
Y are finite-dimensional spaces and A € B(X,Y). Let I{e;} be a orthonormal basis for X so that for all € X 
we have « = 9°, (a, e;)e;, giving x the unique set of coefficients a; = (x, e;). Similarly, let {é;} be an 
orthonormal basis for Y so that for y € Y we have y = )/; (y, €;)é;, giving y the unique set of coefficients 

b; = (y, €;). We will now show that the map « + y = A(z) can be represented in terms of their coefficient 


vectors as a — b = Aa, where A is a matrix. 


Recall that ~; = Ae; € Y, so it can be written as 7; = 25 (wi, €;)é;. Therefore, 
Equation: 


y = 42 = a(S ee) Pid eae | re); = Bye B67) (Wi, Cs), 


a a 


=¥(¥ een nen )s 


Due to the uniqueness of coefficients for y in {€;}, we have that for each J, 
Equation: 


2 (x, ei) (Wi,€;) = (y,é;), 
ay b; 


2 eee Ve 


Aji 
So we have found a matrix A with entries Aj = (yj, €;) that provides Aa = b. Note that the matrix will be of 
size dim(Y) x dim(X). 


Example 1 Consider the space X C Lz [0, 1] defined by X = span “ £2, £3, £4}), given below: 
Ly 


| 


+ 


a 
bo 
oo 


O 


0 1 2 t 
0 


jumsk 
w 
= 
Cs 


Functions in Example 1. 


and the space Y C Lz (0, 4] given by Y = span({y1, y2}), where y: (t) = V2 cos (2at) and 
y2 (t) = V2 cos (4nt). We define an operator A : X + Y as 
Equation: 


y(t) = A(z(t)) = ( [ ‘2(0at) cos (Qnt) + ( i: ‘2 (0dt) cos (4rt). 


It is easy to see that an orthonormal basis for X is given by the functions e; (t) = a; (t). One can also show that 
an orthonormal basis for Y is given by the functions é; (t) = A cos (2nt) and é2 (t) = a cos (4nt). For this 
choice of orthonormal bases for X and Y, the transformed basis elements from X are given by 


Equation: 


W(t) = A(z1 (t)) =cos (2nt) = V2é, 
W(t) = A (x2 (t)) =cos (2nt)+ cos (4nt) = V2 (é; + é), 
W3(t) = A(x3 (t)) =cos (2nt)+ cos (4nt) = V2 (é1 + é2), 
wa(t) = A(xa (t)) =cos (4rt) = V2é. 


It is then easy to check that the entries of the matrix are given by 


Ait = (v1, é1) = V2, Ais = (Wo, é1) = V2, A3 = (W3,é1) = V2, Ais = (wa, 1) = 0, 
Axi = (v1, €2) = 0, Ax» = (2, €2) = V2, An3 = (3, €2) = V2, Axa = (a4, €2) = V2. 


Thus, the matrix representation for the operator A using these orthonormal bases is given by 
Equation: 


J2 /2 V2 0 
ae 


Eigendecomposition of Linear Operators 
Introduces the concept of the eigendecomposition of a linear operator, with properties and 
examples. 


Definition 1 A scalar A is an eigenvalue of A € B(X, X) if there exists a vector e (dubbed the 
eigenvector for A) such that Ae = Ae. 


Multiples of eigenvectors are also eigenvectors, as shown below: 
Equation: 


Definition 2 The eigenspace of A corresponding to 2 is defined by €, = {e € X : Ae = ie}. 


For example, if a given eigenvalue has two eigenvectors, the eigenspace is given by 
[{e1, €2}] where Ae; = Ae; and Ae = ep. 


Definition 3 An operator A € B(X, X) is said to be self-adjoint if A” = A, ie., 
(Az, y) = (a, Ay) forall z,y € X. 


If X = RN, aself-adjoint operator corresponds to a symmetric matrix. 
Theorem 1 All eigenvalues of a self-adjoint operator are real. 


Let \ be a complex eigenvalue of a self-adjoint operator A. Then Ae = Xe for some e. For 
such an e, we have 
Equation: 


Al e |]? =A le,e) = Ve, e) = (Ae, e) = (e,A’e) SleAey Se Nel alee 


Therefore we know that lambda is the same as its complex conjugate (A = ). The only way 
for this to be possible is if the imaginary part of X is zero, and therefore A € R. 


Theorem 2 If Aj, Az are distinct eigenvalues of a self-adjoint operator A € B(X, X), then 
ey) b.€h3: 


Assume we pick some arbitrary e; € €), and €2 € €,. Then, 
Equation: 


Ai (€1, €2) = (A1e1, €2) = (Aer, e2) = (e1, Ae2) = (1, A2e2) = Az (1, €2). 


Therefore we know that Aj (€1, €2) = Az (e1, €2). Since we chose two distinct eigenvalues, we 
know that A; # Az. Therefore we must have (e;, €2) = 0. This implies e, | eg. Since e; and 
e€2 were chosen arbitrarily, this implies €,, 1 €),. 


Theorem 3 (Schur's Lemma) Let X be an N-dimensional Hilbert space and A € B(X, X). 
Then there exists an orthonormal basis {p1, ~2, ..., pw} for X and a set of coefficients 
{Aus}; 4 such that Ag e= 577) Apgeor7.— 1,2) .0N. 


An important note: since {;} makes an orthonormal basis for the domain X and range X, 
there exist coefficients a; such that Ay; = Beet app. for 7 = 1,2; saa: 


Example 1 Recall the matrix representation of an operator: given A € B(X, X) and an 
orthonormal basis fons for X, we can write the matrix representation A of the operator A 


with entries 
Equation: 


= i j Ai, if i<j, 
Ajj = (Ag;, i) = 63 Anson) = >7 Ab; (pe, i) = ‘4 : ony 
k=l k=l , a 


We can represent A as an upper-triangular matrix (where x represents a non-zero entry in the 
matrix): 


Equation: 
hae Ce x 
0 x 
Asao O° x 
0 0 0 £ 


This shows that there exists an orthonormal basis y for which the matrix representation of A is 
an upper-triangular matrix. 


Theorem 4 If X is an N-dimensional space and A € B(X, X), then there exists an 
orthonormal basis { y; ae such that Ay; = A;y; for a set of scalars Ay, A2,... Ax. The matrix 


representation A is a diagonal matrix whose entries are the eigenvalues; that is, 


A = diag ({A;}). 


Note that, according to the theorem, we can fully represent the operator A by the 
aforementioned orthonormal basis {y;} and the diagonal matrix A. 


Pick the orthonormal basis {y1, Y2,... pw} specified by Schur's Lemma. For i < j, we have: 


Equation: 


j i Bins 
Aij = (Api ps) = ( d> Anjos i ) = (05, Avi) = ( G5 9, Anion ) = S> Ani (03, P8)- 
A k=l k=l 


For k < j, the term in the sum is equal to zero. We then have: 
Equation: 


j 
Ag; = So Ans yr = Ajjy;. 
k=1 


Thus, the only non-zero entries of the representation matrix A are the diagonal entries. 
Furthermore, these entries are eigenvalues of A. 


Recall that to compute Az using its matrix representation A, there are three steps: 


1. Represent x using the orthonormal basis {y;} and collect the coefficients into a vector c. 


2. Perform the matrix-vector product d = Ac. 
3. Then obtain Az = b = eae dj Qj. 


A matrix-vector product with a dense matrix is computationally intensive. If we can diagonalize 


A, we can find the matrix-vector product Az by performing the operation Ac, where A isa 
diagonal matrix, making the operation more efficient. 


The Karhuenen-Loeve Transform 
Introduces the Karhuenen-Loeve transform, with applications. 


Define the random vector 
Equation: 


LN 


with mean zero and covariance matrix Rx = E [xX xX | ; this matrix is symmetric and positive 
semidefinite. 


Lemma 1 Every eigenvalue of Rx is real and non-negative. 


Let e be an eigenvector of Rx with eigenvalue A. 
Equation: 


All e |]? =A (e,e) = (Ae, e) = (Ae, e) > 0. 


The last statement falls out by the definite of positive semi-definite. We have A|| e ||” > 0. 
Since || e ||? > 0, it follows that A > 0, ie. all the eigenvalues are non-negative. The 
eigenvectors of the matrix Ry provide an orthonormal basis {y1, Y2,... yn}, which can be 
collected into an orthonormal basis matrix y = [y1 Ye ... Yn]. Then let y = ya. We have: 
Equation: 


Ry =E£ uy = i [p*ax"p| =yE [ex*| =y Rxy. 


Let us look at the adjoint of y Rx: 
Equation: 


* 


(o"Rx) = Rxy = Rx [¢1 v2 -.- pn] = [A1¢1 Azve -» ANON]: 


If we take the adjoint again, we get 
Equation: 


Y1A1 


: S 
eo Ree a 
~nAn 
Going back to our derivation of Ry: 
Equation: 
Yir1 
Ry =e Rxp=| : |[eir -- gn] 
YnAN 
rr (1,01) Ar (1, 2) ++ AD (1, PN) Ay 0 wee 0 
r2 (2,1) Aa (Pa, H2) ++ Az (2, PN) O Az ++ 0 
Aw (pn, ¥1) An (~N,%2) +++ An (YN, GN) G@ OG: coe De 


The matrix y is known as the KLT matrix defined by Rx. The transformation given by the 
KLT matrix provides a set of random variables y; = (y;, x) that are uncorrelated. 


Example 1 (Whitening Filter) For a random vector X, Rx has positive eigenvalues. Let us 


write Ry” = diag Ce ote rv”) and z = Ry ?y where y = ya. We have 


Equation: 


* 


ie a8 22" _@Z Ry yy Ry] _ RIE lw |Ry? _ Ro Rye 7; 


et eee Bitte sean ; ; 
The matrix Ry / y is known as a "whitening filter", as it maps an arbitrary random vector x 
to a “white Gaussian noise” vector z. 


Example 2 (Transform Coding) Let U : C” — C” is a unitary operator. Assume we have a 
signal x € C” that we want to send it through a channel by only sending & numbers or "items", 
where k < n; in words, we wish to compress the signal z. The block diagram for the 
compression/transmission system is given in [link]. 


res A 
U * y Comm i y A 
xX—> Data > : 
compression Channel > Decomp —r U —> X 


Block diagram for a transform coding system. 


We want to minimize E||| « — @ ||| given k by choosing the optimal transformation U. We 
know y = Uz which implies x = Uy since U is unitary. Therefore, 
Equation: 


|| e— @ ||=|| Uy — Ug ||=IU (Y—9) l= y-9 I. 


This means that we can minimize || y — g || in place of || 2 — Z ||. For simplicity, we choose a 
basic means of compression that preserves only the first k entries of y: 
Equation: 


fyi, if i=1,2,...h, 
6. GPP ELE Den 


2 
nA 12 a 2 n n 
We then have E'[|| 2 — @ ||| = # [I y— 9 | = BIoLaes [ys] ] = Dasa Bll’. 
Therefore, 
Equation: 
min L—-2 =min Yi =min L, Uj : 
E a Ely; |? E 
i=k4+1 i=k+1 
=min ( E fsa =min (>: ul E [ca | w) 
G i=k+1 ut i=k+1 
=min & WFR) 
i=k+1 


It turns out that the choice of transform basis U that minimizes this amount is provided by the 
eigendecomposition of R.x, as specified by the following theorem. 


Theorem 1 Let X be a length-n random vector with covariance matrix Rx = E [xx"] that has 
eigenvalues Aj > Az > A3... > An > O and matching eigenvectors yj, ~2,..-pn. Let xy be 
the orthogonal projection of x onto a subspace M of dimension k. Then 

Equation: 


with equality if M = span({y1, ¢2, ...yxr}). 


From equation [link], we have 


Equation: 
‘ _ = T TT. 
min E|\|a — xy||*] =min (>: U; Rao) =min (>: u; PAP w) 
i=k+1 i=k+1 
where Ry = &A@" is the eigendecomposition of Rx. Now, since 
Equation: 
(wis P1) (wi, 91)A1 
(ui; P2) (ui, Y2)rz2 
G'u; = and A@?u,; = ; : 
(Ui, Pn) (Ui, Pn)An 


we have that u) BAS" u; = (f7 ui) TAPT uy; = ony (tes y;)|?A;. Plugging this into [link], 
we have 


Equation: 
min Eljja — ¢y||") =min S sri this; ie Aj =min Sa, wi | (wi, 25) |? : 
t=k+1 i=j i=j i=k+1 
Now, denote aj = Sj" 444 | (ui, 5) * and see that 
Equation: 
n 2 n n 
ya > S> | (ui, v3) (ui, 5) =} I|uil|? = 
j=1 i=k4+1 i=k+1 j=1 t=k+1 
as all u; are unit-norm. Now, we have that 
Equation: 
n n 
min E|||x — xy7\|") =min So Aja; =min YAa)- S- A; (1 —a,;) + > A; |- 
i=j j=k+1 j=k+1 


Since the Ax are monotonically decreasing, we have that 
Equation: 


k n n 
min Bille — ex | >min (doa0. —- So rAd -a)+ SS w)) 
UMA jah jok+1 
k n n n 
>min (>| 3o4- > 1+ > aj} + > w)) 
vy j=l j=k+1 j=ktl jakt1 
>min (>| aj — (n = k) + SS w)) 
- j=l j=kt1 
k n 
>min (> > | —= Ak (n = k) + i ») 5 
o jl GRE 
>min (dst) det) 4 y s)) 
j=k4+1 
>min ( s)) 
Y j=k+1 


If we set M = span{y1, ¢2,... yx}), (i.e, U = &) then it is easy to check that 
Equation: 


n 


Bll — eq] = S> rj, 


j=k+1 


proving the theorem. 


Example 3 (Transform Coding) Transform coding is a common scheme for data compression 
that leverages the Karhuenen-Loéve transform. Examples include JPEG and MP3. In particular, 
JPEG can be broadly described as follows: 


1. Take the image x and create tiles of size 8 x 8. We assume that the tiles are draws from a 
random variable X, i.e., the tiles 71, 79... € R®* with Ry = = Se) Ca 

2. Compute the KLT of the tile random variable X from Rx by obtaining its 
eigendecomposition Ry = AG". 

3. Compute KLT coefficients for each block as c; = 67 2;. 

4. Pick as many coefficients of c; as allowed by communications or storage constraints; save 
them as the compressed image. 

5. Load saved coefficients and append zeros to build coefficient vector ¢;. 

6. Run inverse KLT to obtain the decompressed tiles 2; = &c;. 

7. Reassemble the image from the decompressed tiles. 


In practice, it is not desirable to recompute the KLT for each individual image. Thus, the JPEG 
algorithm employs the discrete cosine transform (DCT). It turns out that the DCT is a good 


approximation of the KLT for tiles of natural images. Additionally, instead of selecting a subset 
of the coefficients, they are quantized to varying quality/error according to their index and the 
total amount of bits available. 


Singular Value Decomposition 
Introduces the singular value decomposition and its application in principal 
component analysis. 


Singular value decomposition (SVD) can be thought of as an extension to 
eigenvalue decomposition for non-symmetric matrices. Consider an m x n 
matrix X. The following two matrices are symmetric and so have 
eigenvalue decompositions 

Equation: 


XX# —UA\U® and X7X=VAV4, 


where XX is an m x m matrix and XX is ann x n matrix. It turns 
out that we can therefore decompose the matrix X as X = ULV, where 
S)isanm x n “diagonal” matrix: 4; = o; are the singular values of X, 
and ¥/;,; = 0 for 2 # 7. The pseudoinverse of the matrix can then be written 


as Xt = V''TU®, where yi, = 1/o; and 2s = 0 fora 3. 


Principal Component Analysis 


Principal component analysis can be thought of as KLT for sampled data. 
Assume that {2 , %2,---xz} C R” is a zero-mean dataset, and collect it 
into a matrix X = [xz,22---2,] C R"”. Next, compute the SVD 

X =USV’! with the corresponding eigenvalue decomposition 

XX? = UAU". The matrix U is known as the principal component 
analysis (PCA) matrix of X; its columns U,, U2, ...U, are known as 
principal components, and its PCA coefficients are given by 

Y =U?TX = SV". The matrix Y contains the “scores” of all data points 
in the columns of X against the principal components U;. One can show 
that the principal components in the matrix U follow the formulation 
Equation: 


nl = 2 
i= argmax€ R ro | (xi, Ui) | 


subject to (w,u;) = 0, 7=1,...,7 —1. 


In words, u,; is the direction in which the projections of the data has the 
largest variance while being orthogonal to {w1, wa, ...u;—1}. 


Optimization in Hilbert Spaces 
Introduces general optimization theory in Hilbert Spaces 


In the remainder of the course we will discuss optimization problems. In general, an optimization problem 
consists of picking the “best” signal according to some metric; the metric will be some functional 

f : X — Y and the search will be over a set of interest D C X, so that the problem can be written as 
Equation: 


a 


€ =argmax f(x) or % =argmin f (zx). 
zeD zeD 


We will need to extend the ideas of derivatives and gradients (which are used in optimization of single- 
variable real-valued functions) to arbitrary signal spaces where we can move in infinite directions on a set 
of interest. 


R R 
f(z) 


f(a) =(@-a)? +8 


Optimization examples. We wish to find the largest or smallest value of a functional f(«), 
x € X, overaset D C X. For scalar-valued functions of scalar fields, the 


maximizer/minimizer is found by solving af =0. 


Directional Derivatives 


Assume that we have a metric function f : X — Y and a set of interest D C X. Navigating the “surface” 
of f to find a maximum or minimum requires for us to formulate a framework for derivatives. 


Definition 1 Let xz € D C X andh € X be arbitrary. If the limit 
Equation: 


Sf (ah) =lim —[f (w + ah) — f (2) 


exists, it is called Gateaux differential of f at x with increment (or in the direction) h. If the limit exists 
for each h € X, the transformation f is said to be Gateaux differentiable at x. If f is Gateaux 
differentiable at all x € X, then it is called a Gateaux differentiable functional. 


This extends the concept of derivative to incorporate direction so it can be used for any signal space. Note 
that a needs to be sufficiently small so that z + ah € D. Note also that for a fixed point x and variable 
direction h, the Gateaux differential is a map from X to Y, ie., Of(z;-): X > Y. 


Fact 1 In the common case of Y = R, 
Equation: 


a=0: 


Of (eihy= 2 fle+ah) 


Example 1 Let H be a Hilbert space and L € B(H, H). Define the function f : H — R by 
f(x) = (La, x). What is its Gateaux differential? From the definition, 
Equation: 


Of (eh) = SUL (a + ah), xz + ah)) 


a=0- 


We compute the derivative: 


Equation: 
(L(z2+ah),2+ah) = (La,x2)+a(Lh,2) +a(La,h) +07 (Lh,h), 
3) 
Ba (ME (+ah),2+ah)) = (Lh,x) + (La,h) + 2a(Lh,h). 
la 
Therefore, 
Equation: 


Sf(a;h) = (Lh,x) + (La,h) = (Lh,2) + (h, Le) = (h, Le) + (h, Le), 
= (h, (z + L’)z). 


Unfortunately, the Gabeaux differential does not satisfy our need to connect differentiability to continuity. 


Definition 2 Let f : X — Y bea transformation on D C_X. If for each x € D and each direction h © X 
there exists a function 6 f(x; h) : D x X — Y that is linear and continuous with respect to h such that 
Equation: 


wm Uf(@ +h) ~ f(z) — SF (eh) II 
\ial| 0 || 2 || 


= 0, 


then f is said to be Fréchet differentiable at x and 6 f(z; h) is said to be the Fréchet differential of f at x 
with increment h. 


One can intuitively see that there is a stronger connection between the common definition of a derivative 
(for functions R — R) and the Fréchet derivative. There are additional connections between the 
derivatives and their properties. 


Lemma 1 If a function f is Fréchet differentiable then 6 f(x; h) is unique. 


Lemma 2 If the Fréchet differential of f exists at x, then the Gateaux differential of f exists at x and they 
are equal. 


Lemma 3 If f defined on an open set D C X has a Fréchet differential at x then f is continuous at x. 


For a Fréchef-differentiable function, for any € > 0 there exists a sufficiently small h € X such that 
Equation: 


| Fla +h) ~ fle) — 9F(@5A) Il 
I| II 


This in turn implies 

Equation: 

| fle +h) — f(a) || <\l fl@ +h) — fle) — 6f(@;h) || + || F(a h) I< € || h |] + | OF (a+) INN IL, 
< (e+ |] 6f(a;+) II) Il AIL 


as Of(«; h) is a linear continuous functional on h, implying that it is bounded. Therefore, as || h || 0, 
we have 
Equation: 


ae f(z+h) — f(z) ||=0. 


This implies that f is continuous at x. 


Local Optimization 
Describes conditions for local optimization in Hilbert Spaces 


We also must define the notion of an extremum in an arbitrary normed 
space. 


Definition 1 Let f be a real-valued functional defined on (2 C X where _X 
is anormed space. A point x9 € 92 is a local/relative minimum of f on 92 if 
f (ao) < f (a) for all 2 € 2 such that || r—zp ||< € for some € > 0. 


Definition 2 Let f be a real-valued functional defined on (2 C X where _X 
is anormed space. A point x9 € (2 is a local maximum of f on 92 if 
f (ao) > f (x) for all 2 € Q such that || r—zp ||< € for some € > 0. 


Definition 3 Let f be a real-valued functional defined on (2 C X where _X 
is anormed space. A point x29 € 2 is a local strict minimum of f on 92 if 
f (ao) < f (x) for all 2 € 2 such that || r—zp ||< € for some € > 0. 


Definition 4 Let f be a real-valued functional defined on (2 C X where _X 
is anormed space. A point 29 € (2 is a local strict maximum of f on 2 if 
f (x0) > f (a) for all e € such that || x—zp ||< € for some € > 0. 


It turns out the notion of a gradient is intrinsically linked to the directional 
derivatives we have introduced. 


Definition 5 Let X be a Hilbert space and f : X — R. If f is a Fréchet 
differentiable functional, then for each x € X there exists a vector in X 
such that 6f(2;h) = (h, Vf(ax)) for all h € X; the vector Vf(z) is called 
the gradient of f at x, and can be written as a functional Vf : X > X. 


This definition can be seen to correspond to an application of the Riesz 
representation theorem to the Fréchet derivative 6 f(x; h), which is a linear 
bounded functional on h. 


Example 1 We know now that: 
Equation: 


bf (#;h) = (h, VF(2)). 


By the Cauchy-Schwarz Inequality, we have: 
Equation: 


f(a; h)| = |(h, VF(a))| SIP III VEC) IL 


Ifh = Vf(«) then 6f(ax; h) is maximized. 


Example 2 Recall that if f : R” — R, then 
Equation: 


of 
Oxy, 
of 


Oxy 


Vf (2) = 


Of 
Orn 


Theorem 1 Let f : X — R have a Gateaux differential on X. A necessary 
condition for f to have an extremum at xo € X is that 6 f(x; h) = 0 for all 
h € X. Alternatively, if X is a Hilbert space, we can write 

(h, Vf (zo)) = 0 for all h € X, which implies Vf(xo) = 0. 


Suppose 2g is a local minimum. Then there exists € > O such that if 
|| e—2p9 ||< € then f (ao) < f (x). Fix h $ 0 and let 0 = Tay: Next, 


consider z = 29 + ah. For a € (—6, 8): 


e Ifa > 0 then eet ah) feo) > 0, and therefore df (29; h) 
°¢ Ifa < Othen Leotah)" feo) < 0, and therefore 6 f (20; h) 


= 0. 
<0. 


Therefore, 6 f(a; h) = 0 for arbitrary nonzero h. Now since d f(x; h) is 
linear on h we must have 6f(x;h) = 0 for h = 0. Therefore, the equality is 
true forallh € X. 


Definition 6 A point at which 6f(x;h) = 0 for all h € X is called a 
stationary point of f. 


Calculus of Variations 
Introduces calculus of variations problems in optimization and their solutions based on the Euler-Lagrange 
equation. 


The calculus of variations refers to a generic class of optimization problems that can be written in terms of 
Equation: 


minimize J(x) 


subjectto ax (t;) = a1, x (tz) = a2, 


where J (x) : C [t1, t2] — R is a function that can be written as 


Equation: 
tg 
J(e)= [fle @,e(t),tat, 
ty 
where & (t) = BO The function f must meet the following conditions: 


e f(x (t), & (t), t) is continuous on x(t), « (t), and t as individual inputs, 
e f(x (t),@ (t), t) has continuous partial derivatives with respect to x(t) and & (t), written as 


Equation: 
fe(2(t),(t),t) = 2 -fe(2 2,2) 
fa(e(,2(0),t) =< fa(2(d),2(),2) 


Consider the set of admissible functions  € Ct, t2]. One can pick any particular admissible function x and then 
define the rest of the set in terms of « + h, where h (t,) = h (t2) = 0. Therefore, at an optimum z, we require the 
directional derivative in the feasible directions h to be zero valued, i.e., 0J(x; h) = 0 for all feasible directions h. 

Now recall that since J(z) is scalar-valued, we have 

Equation: 


OI ash) =I (e+ ah)|aao, 


= am iN lf(e (t) + ah (t),&(t) + ah (t),t) |dtlao 


i. a [f(z () + ah (t), (8) + ah (t),t) Jdtlao. 


i. 


We use the following fact: 


Fact 1 For any function g(z, y, z), we have that 


Equation: 
5) ( ) A) ( ) A) ( ) 
r+ ar + ay1, Zz) = r+ ari,yt+ ay, zZ)r1 4 L+ar1,y+ Ay, Z)y1- 
g Ly Y1 ee Ly Y1 1 yo 1,Y Y1,2)Y1 
Therefore, 


Equation: 


& 
= 
= 
= 
I 


I. FE (2(t) +ah(t),¢() +ah(t),t)h+ 27 (0(t) +ah(,e() +ah oA Ble 


ta tg 
/ te (2 +az,“£+ah, t) hdt + / Si (« +az,“z+ah, t) hat F 
ty ty 


a=0 


where we drop the dependence of the functions for brevity (i.e., z(t) is written x). Using integration by parts ( 
u= fe (« +ax,%+ ah,t), dv — hdt), we get 


Equation: 


tg 


OJ(x;h) = if fe (x +02, 4 + ah, t) hdt 4 fa (« (t) + ah (t),¢(t) tah (t),t)r()] 
[ h . ti (2 + ah, x + ah, t) a = 


since h (t;) = h (tz) = 0, we have that 
Equation: 


t=ty 


AI (a;h) = { [nw 1. (x + ah, & + ah, t) a fs (« + ah, & 4 ait) | ar} , 


a=0 


~ / h(t) 1 (2 (t), 8,8) — Shs (@ (),20,9)] dt. 


It follows that for OJ(«;h) = 0 for all feasible directions a, we must have 
Equation: 


0 


fe (x (t), £ (t), t) ~~ a fe (x (t), & (t), t) = 0, 


which provides the following condition on the solution x(t) of the problem [link]: 
Equation: 


This condition is known as the Euler-Lagrange equation. 


Example 1 Looking for a function z (t) € C'[t1, t2| that minimizes the length of the path (curve) between the 
points (1, c1) and (t2, cz), as shown in [link]. 


ti t2 


Example of a calculus of variations problem: finding 
the curve connecting two points that achieves 
minimum length, which is computed in a differential 
fashion. 


In this case, we can consider increments of the curve's length J in terms of differences of the input dt and the 
output da: 
Equation: 


2 
dL = v dt? + dx? ana (F) dtV1+2?, 


and by integrating both sides we get that 
Equation: 


tg t2 es t2 
b= | a = [ vis wae = [ f (x, &, t)dt, 
ty ty ty 


where we have written f (x, £,t) = 1+ «2, and the problem has set the boundary conditions x (t;) = ci, 
az (ty) = C2. Thus, we have a calculus of variations problem. 


To obtain the solution to this problem, we set up the Euler-Lagrange equation: 


Equation: 
foley) =U 
fi (x, @, t) => (+8)4 22 = - ’ 
V1 + (#)? 


in words, f;, («,£,t) must be a constant as a function of t, which implies that « (¢) must be a constant function of 
t; such a function with constant first derivative is a straight line. Therefore, the shortest path between the two 
aforementioned points is obtained by the straight line that connects them. 


Example 2 Consider a retirement plan with the following constraints: 


¢ Your current capital is S dollars, and ideally by the end of your life you will have spent it all; that is, if x(¢) is 
your capital at time ¢, then x(0) = S and x(T) = 0. 

e Your expense rate is given by the function r(¢), and spending r dollars gives you a quantifiable amount of 
enjoyment u|r(t)]. 


The goal of the planning is to maximize your total enjoyment: 
Equation: 


2 
: e au [r (t)|dt, 
0 


where the exponential weights enjoyment to specify that enjoyment decreases with age. The change in your capital 
is given by its derivative, which must account for your expense rate and the return on investment: 
Equation: 


z= -—r(t)+az(t), 


where a > 1. The problem is thus to maximize the function of your capital function 
Equation: 


T T 
J (x)= i e by [ax (t) — & (t)|dt = i f (x (€), & (t), t)dt, 


which together with the initial constraints gives us a calculus of variation problem. Thus, once again, we set up the 


Euler-Lagrange equation: if we denote u’ [r] = tu [r], then 


d 
Equation: 


fr (a, 2,t) = ae Py! [az (t) — «(t)], 
fi (2, @,t) =e Pu! [ax (t) — « (t)], 


0 ~t, 1 69, 
7g fe (ttt) = Be *u! [ax (t) — «| —e era [ax (t) — &(t)]. 


So we obtain 
Equation: 


ae uy! [ax (t) — &(t)} = Be u’ [aa (t) — ] — oe oa [ax (t) — «(t)], 


(6 —a)u' [ax (t)—a(t)| = au! [ax (t) — # (t)]. 


Now we can switch back to the rate of expense r (t) = az (t) — &(t) to get 
Equation: 


(6 —a)u' [r (t)] = —u'[r (Z)], 


which is a differential equation. The solution for w’ |r (t)] is therefore given by 
Equation: 


wu’ [r (t)] = w' [r (0)Je™*. 


To move forward, we need to select a candidate form for the utility function uw. Our goals for this function is to 
showcase a diminishing marginal enjoyment as one spends increasing amounts of money (i.e., u’ [m] — 0 as 

m —> oo) anda sense of significantly increasing enjoyment as the amount of money spent is small but increasing 
(i.e., u’ (0) = 00). A candidate function that obeys these two conditions is u [m] = 2m1/2, which provides 

u! [m] = m~/?; replacing in [link], we get that the rate of expense must obey 

Equation: 


r(t) 1? = r(0)/2e(8-%E 


Connecting back to the capital function, we get that 
Equation: 


which is another differential equation. The solution to this equation is 
Equation: 


x(t) = et4a(0) + OD (et — ere-o), 


In this equation we can replace z(0) = S; assuming a > 8 > a/2, we can find r(0) by setting T’ = 0 above; 
since x(T) = 0, we get 
Equation: 


(28—a)x(0) __(26-a)8 


r(0)= 1—e@20T 1 —ele-2AT * 


Therefore, the final solution to the problem is 
Equation: 


S 
_ pat at 2(a—B)t 
oh) =e" 5 To eear (c e ), 


e site e(o-2B)t = e(o-28)T 
_ e(a—-28)T 


Constrained Optimization 
Introduces the theory of constrained optimization, including Lagrange multipliers. 


In constrained optimization, we look to minimize or maximize an objective function only over a 
set of inputs (2 C X that can be written in the following form: 


Equation: 
aoeX: gi(z)=0, hi(z) <0, 
i g2(z)=0, ho(zx) <0, 
~ g3(z)=0, hs (zx) <0, 


In words, (2 is a set described by set of equalities and inequalities on x. The constraints g; (x) = 0 
are said to be equality constraints, and the constraints h; (x) < 0 are said to be inequality 
constraints. 


Example 1 Minimize f(x) subject to 73 = 0,1 < 2; < 2and1 < zg < 2, drawn on [link]. 
3 


An example of a feasible set (2 that can be 
expressed using equality and inequality 
constraints. 


In this optimization problem, the feasible set (2 can be written in terms of one equality constraint, 
g(x) = 23, and four inequality constraints: 
Equation: 


hy (x) = 1-21, 
ho (x) = 71 — 2, 
h3 (2) =1- 2a, 
ha (a) = tq — 2. 


Definition 1 The points x € 2 are called feasible points, and the set (2 is called the feasible set. 


In the sequel, we will assume that f, g;, h; are continuous and Fréchet-differentiable (continuous 
gradients). 


We will first consider problems where the feasible set (2 can be expressed in terms of equality 
constraints only. 


Tangent space 


For a given feasible point x9 € §2, the tangent space gives us the set of directions in which one 
can move from 29 while still staying within the feasible set (2. The two examples below show 
tangent spaces in the cases where the set 2 correspond to a curve in R? and a nonlinear manifold 
- 193 

in 


Examples of tangent spaces and gradients. 


The tangent space at zo can be expressed formally in terms of the derivatives of the equality 
constraints. 


Definition 2 The tangent space to the feasible set (2 with equality constraints gj, ..., gn ata 
feasible point zp € (2 is given by 
Equation: 
Ta (to) ={d eX: 69; (a0;d — 2p) = 0, t= 1,2,---, n}, 
={deX: (Vg; (xo),d— 2x9) = 0, i=1,2,---,n}. 


Requiring all the directional derivatives of the equality constraints to be zero in the direction of the 
tangent d — x9 guarantees that the value of the equality constraints remains at zero, therefore 
guaranteeing that d remains a feasible point, i.e., d € 2. 


Definition 3 A point xo satisfying the constraints gi (zo) = 0, g2 (x0) = 0,..., gn (Zo) = Dis 
said to be a regular point if the n linear functionals 4g; (x0; h), dg2 (x0; A), ..-, O9n (03h) (ie., 
the derivatives of the equality constraints) are linearly independent on h. 


Theorem 1 If zo is an extremum of f(x) subject to constraints 
91 (Xo) = 0, go (Xo) = 0, ..., Jn (Lo) = O and ap is a regular point of {g;}%"_, then for any h € X 
such that 6g; (2; h) = 0 for allt = 1, 2,...,, we must have df (xo; h) = 0. 


One can rewrite this theorem in terms of gradients as: if (Vg; (xo), h) = 0, then 

(Vf (xo), 2) = 0. More intuition can be obtained by defining the translated tangent space at 29 as 
follows: 

Equation: 


To (ao) = {d € X : (Vg; (x0), d) = 0,4 = 1,2,---, n}. 


Thus, the theorem above can be written as: if h € Tq (xo), then (Vf (ao), hk) = 0. In words, we 
expect for the derivative of the objective function f at the constrained extremum 29 to be zero- 
valued in the directions in which we can move from ZQ and remain in the feasible set {2 — that is, 
in the directions h € Ty (xo). Thus, we can write that the constrained optimum 29 must obey 


Vf (x9) LT (xo), which implies Vf (29) LT (20). 


Example 2 Let X = R?; solve 
Equation: 


Zo =arg max f(x) =21+ 22 subjectto ary + x} = 1; 
z=(21 zq|" 


In words, we look for the point in a unit circle in R? that has the largest sum of its coordinates. It is 


T 
easy to see by inspection that such point is given by zp) = [v2 v3| . The feasible set can be 
given in terms of the single equality constrain 
Equation: 


g(a) =a} +a%-L=2?a2—1= (2,2) -1= (2,I2) —-1. 


From previous work, we know that the gradient is given by Vg (x) = (I + I°)x = 2Ix = 2x We 


can also write the objective function as 
Equation: 


1 
which has gradient Vf (x) = | 


i . Thus, we can write the tangent space in this case as 


Equation: 


To (x0) = {h EX: (Vg(xo),h) =0} ={h eX: (xo,h) = 0} = {h: V2hi + V2hs =o}. 


Therefore, we can write the tangent space as Tg (ao) = {h € X : hy = —hyg}. It is easy to see at 
this point that for any such h € Ty (x09) we will have (Vf (x9), h) = 0, as stated by 
Theorem [link]. 


Lagrangian Multipliers 


In this section, we will develop a method to solve optimization problems with linear objective 
functions and linear equality constraints. 


Lemma 1 Let fo, f1, f2, ---, fn be linear functionals on a Hilbert space X and suppose fo (2) = 0 
for all x € X such that f; (x7) = 0, ¢ = 1,2,...,. Then there exists constants A1, Ag, .--, An such 
that fo = Afi + Agfa +.» + Anfn- 


Since our functionals are linear we have fo, f1,.--, fn € X . Define the subspace 

M = span({f1, fo, .--; fa}). Since M is finite-dimensional, then M is closed. We can therefore 
define its orthogonal complement: 

Equation: 


M+ = {fex" :(f, fi) =0, i= i 2a}: 


Since Hilbert spaces are self-dual, then for each function f; there exists w; € X such that 
i (x) = (x, w;); therefore, we can rewrite the space above as its dual equivalent 
Equation: 


Now since (w, w;) = f; (w), it follows that for all w € M+ we have that f; (w) = 0, 

w = 1,...,n. Therefore, the Lemma implies that for all w € M+ we have that 

fo (w) = 0 = (w, wo). This implies that wo € (M+) * = M, due to M being closed. Reversing 
to the dual space X™, this implies that fp € M = span ({f;,..., fn}), and so we can write 
F=Aifi + Arafat... + Anfn- 


Theorem [link] shows that we can apply Lemma [link] to the constrained optimization problem. 
Thus, at the extremum x9 € X of the constrained program there exist constants Aj, ..., A, such 


that for all h € X, 
Equation: 


f(aojh =x 69: ( (xo;h 


(Vf (xo), =x Vi (x0), h), 


(v7 0) wai xo); ea 


which is equivalent to Vf (xo) + 52", gi (zo) = 0. This can be written as the gradient of the 
Lagrangian function 
Equation: 


LE (x,rA) = f(z) + 3 igi (x) 


Thus, we say that the extremum must provide a zero-valued directional derivative of the 
Lagrangian in all directions or, equivalently, a zero-valued gradient for the Lagrangian. The results 
obtained in this section can be collected into a proof for the following theorem. 


Theorem 2 If xo € X is an extremum of a functional f subject to constraints {g;};"_,, then there 
exist scalars A],---, An, such that the Lagrangian L (x, A) = f (x) + So, Aug: (2) is stationary 
at Zo, i.e., for all h € X we have that 6L(z, A;h) = 0, ie., VE(ax, A) = 0. 


The constants A1, A2,---, An are known as Lagrange multipliers. 


Example 3 We want to minimize f (x) = x? + x3 subject to 2a, + x = 3. The constraint 
function is 
Equation: 


g(x) = 241+ 22-3. 


From earlier we know that V f(x) = 22, while we can rewrite the constraint as 
Equation: 


2 
so that Vg (x) = . Therefore, the extremum's condition on the gradient of the Lagrangian 


results in the equation 


Equation: 


2 
2a + a| | = 0, 
1 
2x, +2A] — [0 
242 +X 0|’ 
the solution to this equation is 7] = —A, x2 = — +A. To solve for the value of the Lagrangian 


multiplier A, we plug the solution into the constraint: Plug in the constraint function 
Equation: 


2(—A) + (-5) —3=0, 


which gives A = —4. Therefore, we end up with the solution x; = $. LQ = 3. 


Second Order Conditions 
Describes second order conditions in local optimization to find maxima and 
minima. 


When the objective function and constraints f: R” > R,g;: R” —R, 
it is easy to check whether an extremum Zo is a maximum or a minimum of 
the functional. We appeal to second-order differentials, known as Hessians. 


Definition 1 The n x nHessian matrixF'(x) for the functional 
f (x) : R" > R has entries 
Equation: 


_ Of (2) 


F(z);; =e Ox iOx;’ 


i,7=1,2,+++,n. 


Lemma 1 Let (2) be the Hessian of the Lagrangian D(z, A) and let xo 
be an extremum. If d7Y (x9)d > 0 for all d € Tg (x0), then x9 isa 
minimizer. If d? Y (ao)d < 0 for all d € Ty (xo), then x is a maximizer. 


Example 1 Find the extremum of f (x) = 22? + 2223 + £123 subject to 
21 + 22 + x3 = 3 and determine whether it is a maximum or a minimum. 


To begin, we write the optimization's equality constraint: 
Equation: 


g(x) = 27, +%2.+23-3 


The objective function can be written in the form 
Equation: 


fi2\= » aariz; — x" Ag, 


aj 


where the matrix A has entries given by A;; = a,;;. Thus, for our example 
the resulting matrix is 


Equation: 


oF © 


The gradient for this function is given by 
Equation: 


0 0 
Vf(2)=(A+4°)a= 0 2 
1 1 


where B = A+ A”. We can also rewrite the inequality constraint as 
g(x) = 17x — 3, where 1 denotes a vector with entries equal to one of 
appropriate size. Therefore, its gradient is equal to Vg(x) = 1. The 
resulting gradient of the Lagrangian is set to zero to obtain the solution: 
Equation: 


Vf(x) +AV g(x) = 0 


Equation: 
Bz+.A1=0 
Equation: 
Bz=-Al1 
Equation: 
=) 
z=-r\B'1= 0 


Solve for \ from 17 2 = 3 to obtain A = —3 /2. Therefore, the 
optimization's solution is 
Equation: 


3/2 


We can solve for the Hessians of F(x) and G(x): 
Equation: 


We therefore obtain that the Hessian of the Lagrangian is equal to 
Equation: 


ihe eS 
SS oe 
| 
Se) 


0 
L (ao) = F (xo) + AG (a0) = 0 
1 


At this point we need to check if the product h? Y (2 9)h is positive or 


negative for all h € Tg (xq), the tangent space defined as 
Equation: 


To (xo) = {h: (Vg(x), h) = 0} = {h: (1,4) = 0}. 


It is easy to see that h € Te (xo) if and only if hy + hy + hg = 0. To 
begin, we check whether the eigenvalues of #(29) are all positive or 
negative: a calculation returns {—1. 1701, 0. 6889, 2. 4812}. Since neither 


case occurred, we have to specifically consider the case in which 
hy + ho oF hg == 0: 
Equation: 


h'L(xo)h [0, 
3 

Ss Biyhih; Z0, 

ij=l 


2h” + 2hoh3 + 2hyihz 
h2 + hg (hy + ha) 
hy — hs 


NW WV WV 


It turns out that we can find h € Tg (xo) for which the value on the left 
hand side may be positive or negative. Therefore, this is neither a maximum 
or a minimum, and we have found an inflection point. 


Constrained Optimization with Inequality Constraints 
Introduces constrained optimization with inequality constraints and the Karush Kuhn 
Tucker conditions. 


A constrained optimization problem with inequality constraints can be written as 
Equation: 


min f(x) subject to g; (x) =0, 7=(1,---,n), 
h,(@) <0, 9 = (1pm), 


Definition 1 Let xo be a feasible point. If h; (x9) = 0, we say that the constraint h, is 
active at xo, if h; (x_) < 0 then h, is inactive at x9. The set of active constraints at x is 
denoted by; 

Equation: 


J (x) ={j+ hj (x) = 0} 


Definition 2 We say x € $2 is a regular point in 92 if 
{Vg; (x), 7=1,---,n}U{VA, (x), 7 € J(x)} is a linearly independent set. 


Theorem 1 (Karush-Kuhn-Tucker Conditions) if x9 is a regular point of (2 and x9 is 
a local minimizer of f over {2 then there exist scalars A1,---, An and f41,-°°+, fm such 
that; 


1. uj = 0,7 = 1,---,m, 
2. 05" Hjh; (eo) = 0 (complementary slackness condition), 
3. Vf (x0) + Oj AiV9i (@0) + Oy My Wh; (x0) = 0. 


Since h; (xo) < 0, we have that if h; is inactive at xo (ie., if h; (zo) < 0) then we 
must have yw; = 0. Therefore, some versions of the theorem feature only the active 
inequality constraints in the third condition. 


Example 1 We pictorially demonstrate some examples of active inequality constraints: 
consider the case where the set 2 is a convex set bounded by three inequality 
constraints, hy (x), hz (x), and hg (x). Now, consider these three possibilities for the 
minimizer zo of f(x) in 9, illustrated in [link]: 


ho(x) hy(x) ho(x) hy(a) ho(x) hy (x) 


Three examples of the application of the Karhush-Kuhn-Tucker conditions. 


e (a)In this case, the minimizer is in the interior of the set (2, and no constraints are 
active; therefore, the minimizer of the constrained problem matches the minimizer 
of the unconstrained problem, and the solution is found by solving Vf(xo) = 0, 
which ignores all inequality constraints. This means that each yw; = 0. 

¢ (b)In this case, the minimizer has one active constraint (h1 (z)). Consider the 
gradient of the constraint Vh1 (x), which is orthogonal to the tangent space. If 
Vh, (x) and V f(x) are not collinear (scalar multiples of one another), then there 
is a direction within the tangent space in which the value of f(x) would decrease 
and zo would not be a minimizer. Otherwise, if Vf (x) + u1Vh4 (x) = 0 for 
[41 < 0, then both gradients point in the same direction and the value of f(z) 
would decrease by moving in the opposite direction toward the interior of (2, and 
SO Zo would not be a minimizer. Therefore, we must have 
Vf (xo) + wi Vhi (x0) = 0 with 1 > 0, as specified by the KKT conditions. In 
this case the inactive constraints hz (a) and hg (a) are ignored. 


¢ (c)In this case, the minimizer has two active constraints (hj (x) and he («)). In this 
case, the directions d in which we can move from x9 within w must obey 
(d, Vhy (a0)) < O and (d, Vh2 (xo)) < 0. Similarly, moving in a direction d 
decreases the value of f(x + d) below f(x) if (d, Vf (xo)) < 0. Thus, for us to 
be able to find such a direction d we must have that 
Vf (xo) = ahy (ao) + bVh2 (x0) with a > 0 and b > 0, which would give 
Equation: 


Vf (xo) = ah, (x0) = bVho (x0) =: 


Thus, the minimizer xo of f(x) must obey 
Equation: 


Vf (to) + Mihi (0) + H2Vh2 (Zo) = 0, 
with 1 > Oand pe > 0. 


The example illustrates a simple “recipe” for solving inequality constraint optimization 
problems includes the following steps: 


1. Pick a candidate active set J (xq), 
2. Build the corresponding form of the third KKT condition: 
Equation: 


it jeJ(a) 


and solve for x9,A, and p, 
3. If uw; > Oandh; (xo) = 0 for 7 € J (ao) then xo is a solution. Otherwise, pick a 


new candidate active set J (a) and repeat Step 2. 


With some additional assumptions, it can be shown that the KKT conditions can find a 
global minimizer. 


Definition 3 A function f is said to be affine over 2 if f (S77 a;x;) = Soy asf (x;) for 
all x1,...,2, € Mand all weights {a;} obeying 5)? a; = 1. 


Theorem 2 (Karush-Kuhn-Tucker Sufficient Conditions) If f andh,;, 7 = 1,...,m 
are convex functions and g;, 7 = 1,...,n are affine functions, and if the KKT condition 
are satisfied at a feasible point x9 € {2 then xq is a global minimizer of f over 2. 


Fix 21 € let d = x; — zo. Define a functional x (t) = tz, + (1 — t)ap = 20 + td 
over t € [0, 1]. Then, define the constraints limited over the set of points x(t): 
Equation: 


Gi(t) =g9: (x (t)) = 9: (ta1 + (1—t)zo) =t g(a:) + (1—t)g (xo) = 9, 
H(t) = hy (0 (0)) = hy (401 + (1 — fa) < thy (01) + (1 — hy (a0) <0; 


Therefore, all points z(t) € §2 are feasible. Furthermore, note that 
H; (0) =h; (ao) = 0 > H;(t) =h; (at) if 7 € J(xo). Now, we compute the 
derivatives of these two functions with respect to t: 


Equation: 
OG 
“ap = 0 = 89% (20,4) = (V9: (20), 4), 
and for 7 € J(xo), 
Equation: 
OH; 
0 = = Oh; (x0, d) = (Vh; (xo), d). 


ot 


Now consider the function F(t) = f(ax(t)): its derivative is given by 
Equation: 


= = Of (xo, d) = (Vf (zo), d) = -S (V9i,d) xi — So uy (Vhj,d) > 0, 
i=l i=l 


where we use the third KKT condition. Since f(x) is convex and x(t) is affine, then 
F(t) = f(a(t)) is convex int € [0, 1]. Thus % is nondecreasing and 
F(t) — aF(0) 


a = ee = O fort € [0,1]. Thus, C1) > F(O) or f (ai) = f @o).-Since x1 


was arbitrary, 29 is a global minimum of f on #2. 


Example 2 (Channel Capacity) The Shannon capacity of an additive white Gaussian 
noise channel is given by C' = 5 logs (1 + +) , where FP is the transmitted signal 
power and JN is the noise variance. Assume that n channels are available with a total 
transmission power Pr = eS P; available among the channels, where P; denotes the 
power in the i** channel. We wish to assign a power profile P = [P, ..., P,)* that 
maximizes the total capacity for the set of channels 


Equation: 


o(P)= a(R) = 5 ow: (14+ 74), 
i=1 i=1 a 
where JV; represents the variance of the noise in the i” channel. 


To solve the problem, we set up an objective function to be minimized 
Equation: 


f(P) = -C(P) =~ 00 (B) =~ 5 tog, (14+ 2) 


i=1 a 


and also set up the constraints 
Equation: 


g(P) =) P,- Pr =1"P - Pr, 
w=1 


he (P). Se PS el PAS 1 oan, 


as the values of the powers must be nonnegative. We start by computing the gradients of 
these functions: for f, we must compute the directional derivative 
Equation: 
“ h;/N; 
eS 6k 
é P; hi 
aT 2(In2)(1+ # +07] 
( ) Ni Ni a=0 


i=1 2N; (In 2) (1 +4 ) 7 2, 2(In 2)(Nj+P;) — (VF (P), A), 


Of (p; h) = 2 (F (pt ah))laao = = 


Nj; 


where the gradient has entries (Vf (p)), = —(2 (In 2) (Ni + P;))’. 


For the constraints, it is straightforward to see that Vg(P) = 1 and Vh; (P) = —e;, 
I Ag ostis 1B, 


We begin by assuming that the solution P’ isa regular point. Then the KKT conditions 
give that for some A and nonnegative 41, ..., {4m we must have 
Equation: 


2 (In 2) (N; + P’) 


The second set of constraints can be written as 
Equation: 
1 
2 (In 2) (Ni + P; ) 


1 


Ad ee eee 
(In 2) (A pu) 


Consider each inequality constraint h,. 


e If h; is inactive, then P, > 0 and py; = 0. Then, 


Equation: 
Vik = 
Pee = ON OY 
* iL 
Po SS NS 0: 
‘ 2A (In 2) 


¢ If h, is active, then P, = 0 andso 
Equation: 


1 


i= 
(In 2) (X= pu) 
1 
2 > 0, 
v 2N; (In 2) — 
es 
2N; (In 2) — 
1 


———  < NV;. 
2A(In 2) 


To simplify, write r = then, we have two possibilities for each channel 7 from 


— 
2X(In2) ’ 
above: 


© Ifr—N;, > O(ie. if N; <r), then P, =r—N;j. 
e lfir—N; < Oe, ifr < N;) then P, =0. 


Thus the power is allocated among the channels using the formula 

* 
P, =max (0,r — N;), and the value of r is chosen so that the total power constraints 
is met: 


Equation: 


S> max (0,r — N;) = Pr. 
i=1 


This is the famous water-filling solution to the multiple channel capacity problem, 
illustrated in [link]. 


Power 


Channel 


Waterfilling solution to the 
multiple channel power 
allocation problem, which is 
solved using Karush-Kuhn- 
Tucker conditions. 


Fenchel Duality 
Introduces the concepts of an epigraph and a conjugate function necessary to set 
up Fenchel dual problems. 


Convexity and Conjugate Functions 
We begin by reviewing two notions of convexity: for sets and for functions. 


Definition 1 A subset C C X is called convex if z = ax + (1 — a)y € C for 
every x,y © Canda € (0, 1]; zis called a convex combination of x and y. 


Definition 2 Let C' be a convex set. A functional f : C — R is convex onC if 
f(ax + (1 —a)y) < af(x) + (1 — a) f(y) for all z, y € C anda € (0, 1]. If 
strict inequality holds the functional is set to be strictly convex. A functional g is 
called concave (strictly concave) if —g is convex (strictly convex). 


We will denote the region above the function f defined over a convex set C’ as 
[f, C], sometimes called an epigraph, as illustrated in [link]. 


Example of an epigraph. 


Definition 3 Let f be a convex functional on a convex set C’. The conjugate set 
C’” is defined as 
Equation: 


rEC 


C* = fe € X :sup (22°) — f (2) | < coh 
and the conjugate functionalf” : C” — R is defined as 


Equation: 
f (2") =sup [(22") — £(@)]. 


There is a geometric intuition behind the definition of the conjugate functionals. 
Consider the illustration below where the horizontal axis represents the space X 
and the vertical axis represents the scalar field. A hyperplane in this space 
contains all points (z,r) € X x R for which r = (a, x) — k for some value of 
k € R; the vector x determines the orientation of the hyperplane and the value k 
determines the shift from the origin (i.e., the intersect in the axis R). The value of 
the functional f* (a*) corresponds to the supremum value of k& for which the 


hyperplane intersects [f, C], and is finite only for x € C’ ; this is illustrated in 
[link]. 


r= (z,2*)—k 


The conjugate function of a convex epigraph. 


Note that C’” is convex and fF. is convex. This definition is easily extended to 
concave functionals. 


Definition 4 Let g be a concave functional on a convex set D. The conjugate set 
D’ is defined as 
Equation: 


D' = fe" e X : inf ata) — 9(2)| > -co}, 
zED 
and the conjugate functionalg’ : D” — R is defined as 


Equation: 
a (2") =f [{a.2")-F@)]. 


cart Ee 
Note that D is convex and g is concave. 


Fenchel Duality 


The following theorem will allow us to convert an optimization problem with a 
convex objective function into a dual problem with a concave objective function. 


Theorem 1 (Fenchel) Assume that f and g are convex and concave functions, 
respectively, on convex sets C' and D in a normed space X. Assume that C'M D 
contains points in the relative interior of C and D and that either [f, C] or [g, D] 
has a nonempty interior. Suppose further that 

Equation: 


w= inf {f(x)—g(z)} 


xECnD 


is finite. Then, 
Equation: 


w= inf {f(x)—g(x)} = max {9 (2")- 7° (2") } 


rECnD x2*EC*nD* 


* 
where the maximum is achieved by some zy € C° MD”. 


In this theorem, g(a) is usually set to zero. From a geometrical point of view, the 
theorem states that there are two ways to interpret the minimum distance between 
the two epigraphs [f, C] and |g, D| shown below: one in terms of the original 
functions f, g and one in terms of the duals f . g: we look for the two tangent 
hyperplanes for f and g that are maximally separated from one another, as 
illustrated in [link]. 


Illustration of the Fenchel dual problem on a 
conjugate function. 


If the infimum on the left is achieved by zo, then 
Equation: 


Example 1 (Allocation) Assume that we have a capital x9 available for 
investment with n different funds. There is a predicted gain g; (x;) to having 
stock worth x; at fund 2, where the functions g; are concave. We aim to find the 
optimal allocation of the capital x = (a1,...,2,,) that maximizes the total gain 


g(x) = ae, gi (x:). 


To appeal to duality, we have the concave function g(x) and must define a convex 
function, e.g., f(z) = 0. The constraint set can be written as the intersection 


CN D with 

Equation: 
C= ‘" > & = «| Seale) ag 
Do alr e 2 0a = a). 


Therefore, we can write our optimization problem as 
Equation: 


min {f (x) — 9(z)}- 


rECnD 


We consider the conjugate sets. First, we have 
Equation: 


rEC 


= {2" €X sup [(2,2")] < oo}. 


ot = fa" eX sup [(22") - f(w)] <oo}, 


We want to define C” more explicitly. Let x € C” be writtenasz = 1+ u, 
where w _ 1. Then, 
Equation: 


ww) = (Lu) + (w, 2) = I] | 
(aw) 


which can be arbitrarily large. Now let x = = 1 + Aw; it is easy to check that 
z € C forall\ € R. If w 4 O then (zz) = Ato + All w \|?, which again can 
be arbitrarily large. Since xz €C must hold that SUPz<C ee x) < oo, We must 


have that w = 0 andso x € span ({1}). Since x° € C” was arbitrary, then 
C” C span ({1}). 


It is also easy to see that span ({1}) C C’, therefore implying that 
C” = span ({1}). 


For D, we have a conjugate set 
Equation: 


DY = {y:inf (ev) — 9(2)] > -00}, 


= { ving (ev) — 9(2)] > -00}, 


since g(x) < 29. Now D C D‘ since all vectors in D have nonnegative entries. 
Fix y € D’; if y € D then there is some negative entry y; < 0 among 

i= 1,...,n. For suchz let x, = Ae; € D for some A > 0; then we get 

(xy, Y) = Ay; which can be arbitrarily close to 00 (i.e., as AX — 00 we have 
(x,y) —> —oo. Thus y ¢ D’, a contradiction. Therefore, if y € D* then y € D 
and so D™ C D. We have therefore shown that D' =D. 


The conjugate functionals can be written as 
Equation: 


f (<*) =sup aye) = AZ, 


rEeC 


since each x € C’ can be written as z = A1. Therefore, f- can be written as a 


function of a single variable. Similarly, 
Equation: 


(2°) ng ((22") -ot)) <ng (Saat -ae) = 50 


where we write 
Equation: 


For 2 € C° M D* wecan write z; = \ > 0 forall i = 1,...,n and so 
Equation: 


* 


9; (2; ) = 9; (A) = inf (Aes — 9: (#1)). 


Therefore, the original problem can be reformulated as the following single- 
variable problem: 
Equation: 


A” =min ae -So9; a) 


i=1 


This is due tox = A1 © C’ MD’ if and only if \ = 0. Once X° is found, we 


can find each x; as the minimizer in g; (\°) , cf. [link]. 


